It produces garbled text when text is extracted. e.g.
import pymupdf
doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)
I think this is a PDF problem! How can we check?
anon-test.pdf (90.4 KB)
It produces garbled text when text is extracted. e.g.
import pymupdf
doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)
I think this is a PDF problem! How can we check?
anon-test.pdf (90.4 KB)
The fonts in this PDF are missing back-translation information [visible glyph] ==> Unicode.
The default flags for text extraction try to circumvent this by returning glyph numbers whenever the Invalid Unicode character � is returned.
Often helps, but sometimes only increases confusion - like here.
Try using page.get_text(flags=0) to confirm that you normally would see � characters.
In any case, there is no way to improve this situation - except using OCR of course.
How can you detect this? Is there a PyMuPDF command which you can run which gives you this info? Something like “validate PDF” or whatever. ![]()
Unfortunately, the missing information may not be “universal” in the sense “not there for anything”: It may be just one font out of many on the page with the problem, or even worse: just 1 or a handful of glyphs out of many okay ones are missing this.
I am considering changing the default flags to not trying this auto-replacement. So the situation becomes detectable. I have done that in PyMuPDF4LLM - probably should do it in general.
Hmmm, when I check it in Adobe Acrobat I get this weird font name:

Also if I select, copy and paste the text from Adobe ( or Preview ) then I also get garbage.
So it seems this is just a badly made PDF, right? ( or deliberately “bad” to prevent easy text copying )
Exactly the right check. If any time a PDF viewer can successfully copy/paste where we return rubbish, then - and only then - we have a problem.
Thanks — Got it! ![]()