Better line structure in earlier versions, what happened?

When extracting text from PDFs using pymupdf4llm.to_markdown(), version 0.0.17 preserves line breaks and structure much closer to the original PDF.
In version 0.0.27, the output merges or alters lines, making it harder to parse tables of contents and similar structured text.
For my use case, extracting a ToC from a scanned PDF, 0.0.17 produces more accurate results.
I haven’t check the versions in between the two, but can do that if this is something you want to look into.

(I posted this as an issue in the repo before seeing the post about posting here, sorry for dubbel posting…)

Hi @lasse and welcome to the forum!

Is there any way you could share your PDF so we can take a look at the markdown results better?

It’s a document from a police investigation, so I don’t want to upload it here. I could send it to you in a DM, or else find another document where the same problem appears.

Oh, definitely don’t post it if it is confidential! If it is easy to replicate in another file then that would be easiest I think :slight_smile: