Better line structure in earlier versions, what happened?

lasse · September 12, 2025, 10:07am

When extracting text from PDFs using pymupdf4llm.to_markdown(), version 0.0.17 preserves line breaks and structure much closer to the original PDF.
In version 0.0.27, the output merges or alters lines, making it harder to parse tables of contents and similar structured text.
For my use case, extracting a ToC from a scanned PDF, 0.0.17 produces more accurate results.
I haven’t check the versions in between the two, but can do that if this is something you want to look into.

(I posted this as an issue in the repo before seeing the post about posting here, sorry for dubbel posting…)

Jamie_Lemon · September 12, 2025, 12:18pm

Hi @lasse and welcome to the forum!

Is there any way you could share your PDF so we can take a look at the markdown results better?

lasse · September 12, 2025, 12:49pm

It’s a document from a police investigation, so I don’t want to upload it here. I could send it to you in a DM, or else find another document where the same problem appears.

Jamie_Lemon · September 12, 2025, 12:51pm

Oh, definitely don’t post it if it is confidential! If it is easy to replicate in another file then that would be easiest I think

Topic		Replies	Views
Pymupdf4llm unexpected reordering of output after v0.0.17 PyMuPDF	1	58	September 19, 2025
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	12	53	May 6, 2026
BUG: double column pdfs text extracted in wrong order PyMuPDF	2	68	January 16, 2026
Pymupdf4llm: byte order marks in TOC prevent recognition of the document structure PyMuPDF	2	38	October 2, 2025
Pymupdf layout table detection issue PyMuPDF	14	137	February 24, 2026

Better line structure in earlier versions, what happened?

Related topics