I’m trying to use PyMuPDF4LLM to generate Markdown.
The document I’m testing with has a TOC and has multiple font sizes ([4.0, 8.0, 10.0, 11.0, 12.0, 14.0, 18.0, 20.0])
The issue is that if I use pymupdf4llm.to_markdown(doc, header=False, footer=False)I end up with the document title from the cover page (20 pt) as H1, and all other “larger” font text as H2. It turns 20, 18, 14, and 12 into H2, when the document TOC is 3 layers deep.
I realize that by default it uses “Layout” detection for headers, but I don’t understand why it’s not using 20 pt as H1, 18 as H2, and 14 as H3. That would be a huge improvement over the current output for this document.
I’ve done a bunch of work to sanitize my PDF before passing to PyMuPDF4LLM to try and improve detection: I’ve stripped empty spans and merged adjacent spans with the same font properties.
How can I get PyMuPDF4LLM to properly generate Header Levels? Right now, this processing makes the output very sub-optimal and I’m not finding many levers I can pull to hint what should be headers and what should not be.
The document I’m trying to parse: https://cdn-cms-frontdoor-dfc8ebanh6bkb3hs.a02.azurefd.net/getmedia/4ca430cb-b295-4fdd-b003-d7af10fd5bc5/moxa-industrial-smart-ethernet-switch-users-manual-v3.1.pdf