Issues with TOC processing

Luciano_Moretti · June 15, 2026, 4:59pm

I’m trying to use PyMuPDF4LLM to generate Markdown.
The document I’m testing with has a TOC and has multiple font sizes ([4.0, 8.0, 10.0, 11.0, 12.0, 14.0, 18.0, 20.0])

The issue is that if I use pymupdf4llm.to_markdown(doc, header=False, footer=False)I end up with the document title from the cover page (20 pt) as H1, and all other “larger” font text as H2. It turns 20, 18, 14, and 12 into H2, when the document TOC is 3 layers deep.

I realize that by default it uses “Layout” detection for headers, but I don’t understand why it’s not using 20 pt as H1, 18 as H2, and 14 as H3. That would be a huge improvement over the current output for this document.

I’ve done a bunch of work to sanitize my PDF before passing to PyMuPDF4LLM to try and improve detection: I’ve stripped empty spans and merged adjacent spans with the same font properties.

How can I get PyMuPDF4LLM to properly generate Header Levels? Right now, this processing makes the output very sub-optimal and I’m not finding many levers I can pull to hint what should be headers and what should not be.

The document I’m trying to parse: https://cdn-cms-frontdoor-dfc8ebanh6bkb3hs.a02.azurefd.net/getmedia/4ca430cb-b295-4fdd-b003-d7af10fd5bc5/moxa-industrial-smart-ethernet-switch-users-manual-v3.1.pdf

HaraldLieder · June 15, 2026, 5:16pm

Welcome @Luciano_Moretti !

It may indeed be a little confusing:
If you use the default pymupdf4llm config, then the Layout module is active.
In that case there is no header level detection and also no way to use the TOC for this.

The reason is that the Layout module uses an AI model (a GNN) and only can differentiate between “title” (level 1 = “#”) and “section-header” (level 2 = “##”). That’s it.

You must deactivate layout if you want to use multiple header levels (derived from font sizes) or TOC.
Of course you will then loose layout detection comfort as well.

Topic		Replies	Views
Pymupdf4llm: byte order marks in TOC prevent recognition of the document structure PyMuPDF	2	49	October 2, 2025
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	12	85	May 6, 2026
Pymupdf layout table detection issue PyMuPDF	14	160	February 24, 2026
BUG: parameter page_chunks is ignored when passed to pymupdf4llm.to_markdown PyMuPDF	2	42	December 8, 2025
BUG: pymupdf4llm list index out of range in document_layout.py PyMuPDF	9	77	December 2, 2025

Issues with TOC processing

Related topics