Extracted page text includes annotations (type FreeText)
When extracting text using:
page_text = page.get_text("text")
The text from annotations of type FreeText is included in the extracted page text.
Example workflow:
import pymupdf
with pymupdf.open(pdf_path) as doc:
for page in doc:
page_text = page.get_text("text")
The extracted page_text contains text that originates from FreeText annotations, even though this text is not part of the page content stream (/Contents).
Inspecting the raw page content confirms the annotation text is not present there:
xref_list = page.get_contents()
for xref in xref_list:
stream = doc.xref_stream(xref)
print(stream[:500])
The text appears to come from the annotation appearance stream (/Annots -> /AP), which get_text() seems to include.
Question
How to extract page_text which includes only PDF page text, without FreeText annotations?