Pymupdf4llm: byte order marks in TOC prevent recognition of the document structure

Hey,

I’ve already posted it on GitHub, but then saw the recommendation to use the forum. I’m extracting this paper to markdown like this:

doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)

I noticed that the TocHeaders start with UTF-8 byte order marks: '\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'. This prevents recognising the document structure, because title.startswith(text) fails.

For a quick fix, you could just strip the BOM in get_header_id. Recognition also fails if there are non-breaking spaces (\xa0), I suggest to replace them. See my pull request #309.

Greetings,
Paul

Hi @soelderer - thanks for the post and I see the issue. Your PR seems reasonable enough to me - @HaraldLieder What do you think?

I fully agree. Just hadn’t the time to respond properly. The fix is non-toxic as it does nothing if the string doesn’t start with a BOM. Similar is true for replacing the unbreakable space by a normal one.
Definitely the next version will include all that.

1 Like