Any idea what is wrong with this PDF?

Jamie_Lemon · July 9, 2025, 1:01pm

It produces garbled text when text is extracted. e.g.

import pymupdf

doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)

I think this is a PDF problem! How can we check?

anon-test.pdf (90.4 KB)

HaraldLieder · July 9, 2025, 3:03pm

The fonts in this PDF are missing back-translation information [visible glyph] ==> Unicode.
The default flags for text extraction try to circumvent this by returning glyph numbers whenever the Invalid Unicode character � is returned.

Often helps, but sometimes only increases confusion - like here.
Try using page.get_text(flags=0) to confirm that you normally would see � characters.

In any case, there is no way to improve this situation - except using OCR of course.

Jamie_Lemon · July 9, 2025, 3:05pm

How can you detect this? Is there a PyMuPDF command which you can run which gives you this info? Something like “validate PDF” or whatever.

HaraldLieder · July 9, 2025, 3:13pm

Unfortunately, the missing information may not be “universal” in the sense “not there for anything”: It may be just one font out of many on the page with the problem, or even worse: just 1 or a handful of glyphs out of many okay ones are missing this.

I am considering changing the default flags to not trying this auto-replacement. So the situation becomes detectable. I have done that in PyMuPDF4LLM - probably should do it in general.

Jamie_Lemon · July 9, 2025, 3:23pm

Hmmm, when I check it in Adobe Acrobat I get this weird font name:
Screenshot 2025-07-09 at 16.20.21

Also if I select, copy and paste the text from Adobe ( or Preview ) then I also get garbage.

So it seems this is just a badly made PDF, right? ( or deliberately “bad” to prevent easy text copying )

HaraldLieder · July 9, 2025, 3:29pm

Exactly the right check. If any time a PDF viewer can successfully copy/paste where we return rubbish, then - and only then - we have a problem.

Jamie_Lemon · July 9, 2025, 3:44pm

Thanks — Got it!

Topic		Replies	Views
How to fix code=4: no font file for digest? How To	3	84	June 30, 2025
Spaces missing after extracting text with Page.get_text() PyMuPDF text	7	19	February 25, 2026
OCR disabled because OpenCV not installed PyMuPDF	16	112	January 6, 2026
Pymupdf layout table detection issue PyMuPDF	14	69	February 24, 2026
Embed Font in existing PDF PyMuPDF font	3	39	February 5, 2026

Any idea what is wrong with this PDF?

Related topics