When extracting text from PDFs using pymupdf4llm.to_markdown(), version 0.0.17 preserves line breaks and structure much closer to the original PDF.
In version 0.0.27, the output merges or alters lines, making it harder to parse tables of contents and similar structured text.
For my use case, extracting a ToC from a scanned PDF, 0.0.17 produces more accurate results.
I haven’t check the versions in between the two, but can do that if this is something you want to look into.
(I posted this as an issue in the repo before seeing the post about posting here, sorry for dubbel posting…)