I’m trying to extract text from this pdf https://openreview.net/pdf?id=g90RNzs8wX using pymupdf4llm.to_markdown(pdf_path), is there a way to fix a font error? Thanks!
Interesting, I see the error I think on page 26:
[========================================e=RuntimeError('code=4: no font file for digest')
I was running the following command:
md_text = pymupdf4llm.to_markdown("1522_Unifying_Unsupervised_Gra.pdf", page_chunks=False, extract_words=False, show_progress=True)
If I extract that page then it works. ( see my 1522_Unifying_Unsupervised_Gra-edit.pdf file )
@HaraldLieder What do you think is “wrong” with page 26 here?
1522_Unifying_Unsupervised_Gra-26.pdf (720.9 KB)
1522_Unifying_Unsupervised_Gra-edit.pdf (1.0 MB)
Also @eamag Welcome to the forum and thanks for your post!!! ![]()
This is caused by an upstream (MuPDF) problem. Recent versions of PyMuPDF4LLM make active use of MuPDF’s advanced detection of “faked” bold text. This is text written with a standard (non-bold) font such that it appears bold by writing the same text twice … with a small displacement.
This algorithm is quite complex and only works for non-Type3 fonts. The error you report currently happens because of a missing check for text in a Type 3 font.
MuPDF bug report has already been submitted.