Issues with TOC processing

I’m trying to use PyMuPDF4LLM to generate Markdown.
The document I’m testing with has a TOC and has multiple font sizes ([4.0, 8.0, 10.0, 11.0, 12.0, 14.0, 18.0, 20.0])

The issue is that if I use pymupdf4llm.to_markdown(doc, header=False, footer=False)I end up with the document title from the cover page (20 pt) as H1, and all other “larger” font text as H2. It turns 20, 18, 14, and 12 into H2, when the document TOC is 3 layers deep.

I realize that by default it uses “Layout” detection for headers, but I don’t understand why it’s not using 20 pt as H1, 18 as H2, and 14 as H3. That would be a huge improvement over the current output for this document.

I’ve done a bunch of work to sanitize my PDF before passing to PyMuPDF4LLM to try and improve detection: I’ve stripped empty spans and merged adjacent spans with the same font properties.

How can I get PyMuPDF4LLM to properly generate Header Levels? Right now, this processing makes the output very sub-optimal and I’m not finding many levers I can pull to hint what should be headers and what should not be.

The document I’m trying to parse: https://cdn-cms-frontdoor-dfc8ebanh6bkb3hs.a02.azurefd.net/getmedia/4ca430cb-b295-4fdd-b003-d7af10fd5bc5/moxa-industrial-smart-ethernet-switch-users-manual-v3.1.pdf

Welcome @Luciano_Moretti !

It may indeed be a little confusing:
If you use the default pymupdf4llm config, then the Layout module is active.
In that case there is no header level detection and also no way to use the TOC for this.

The reason is that the Layout module uses an AI model (a GNN) and only can differentiate between “title” (level 1 = “#”) and “section-header” (level 2 = “##”). That’s it.

You must deactivate layout if you want to use multiple header levels (derived from font sizes) or TOC.
Of course you will then loose layout detection comfort as well.