Pymupdf4llm: byte order marks in TOC prevent recognition of the document structure

soelderer · October 2, 2025, 6:31am

Hey,

I’ve already posted it on GitHub, but then saw the recommendation to use the forum. I’m extracting this paper to markdown like this:

doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)

I noticed that the TocHeaders start with UTF-8 byte order marks: '\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'. This prevents recognising the document structure, because title.startswith(text) fails.

For a quick fix, you could just strip the BOM in get_header_id. Recognition also fails if there are non-breaking spaces (\xa0), I suggest to replace them. See my pull request #309.

Greetings,
Paul

Jamie_Lemon · October 2, 2025, 1:07pm

Hi @soelderer - thanks for the post and I see the issue. Your PR seems reasonable enough to me - @HaraldLieder What do you think?

HaraldLieder · October 2, 2025, 5:40pm

I fully agree. Just hadn’t the time to respond properly. The fix is non-toxic as it does nothing if the string doesn’t start with a BOM. Similar is true for replacing the unbreakable space by a normal one.
Definitely the next version will include all that.

Topic		Replies	Views
Better line structure in earlier versions, what happened? PyMuPDF	3	57	September 12, 2025
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	9	12	April 15, 2026
Pymupdf4llm forcing re-OCR, on doc that has ocr_spans PyMuPDF font	9	21	April 17, 2026
Issue: Hyperlink extraction from pdf to markdown is not working PyMuPDF text	1	20	January 8, 2026
BUG: double column pdfs text extracted in wrong order PyMuPDF	2	44	January 16, 2026

Pymupdf4llm: byte order marks in TOC prevent recognition of the document structure

Related topics