Extracted page text includes annotations page_text = page.get_text("text")

Extracted page text includes annotations (type FreeText)

When extracting text using:

page_text = page.get_text("text")

The text from annotations of type FreeText is included in the extracted page text.

Example workflow:

import pymupdf

with pymupdf.open(pdf_path) as doc:
    for page in doc:
        page_text = page.get_text("text")

The extracted page_text contains text that originates from FreeText annotations, even though this text is not part of the page content stream (/Contents).

Inspecting the raw page content confirms the annotation text is not present there:

xref_list = page.get_contents()
for xref in xref_list:
    stream = doc.xref_stream(xref)
    print(stream[:500])

The text appears to come from the annotation appearance stream (/Annots -> /AP), which get_text() seems to include.

Question

How to extract page_text which includes only PDF page text, without FreeText annotations?

Hi @bik123 Welcome to the forum and thanks for your post.

I think the trick is to delete the free text annotations whilst you iterate through the pages and check for the annotation types that you want to remove.

This worked for me:

for page in src:
    xrefs = [annot.xref for annot in page.annots(types=[pymupdf.PDF_ANNOT_FREE_TEXT])]
    for xref in xrefs:
        a = page.load_annot(xref)
        page.delete_annot(a)

    text = page.get_text()
    print(text)
1 Like