Hey,
I’ve already posted it on GitHub, but then saw the recommendation to use the forum. I’m extracting this paper to markdown like this:
doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)
I noticed that the TocHeaders start with UTF-8 byte order marks: '\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'. This prevents recognising the document structure, because title.startswith(text) fails.
For a quick fix, you could just strip the BOM in get_header_id. Recognition also fails if there are non-breaking spaces (\xa0), I suggest to replace them. See my pull request #309.
Greetings,
Paul