Hello and thank you for the project! I am using pymupdf4llm
for converting PDFs to Markdown. The quality of conversion seemed best in v0.0.17 for the documents I work with. I’ve been tracking the issues others have opened since then (for example, GitHub issues #261 and #289) and none of the advice and subsequent releases have improved the issues I’ve seen. I also saw Better line structure in earlier versions, what happened? and figured I could help by providing a specific example of the biggest blocker I have in upgrading.
With this PDF (all of the info inside is fake) as input: input.pdf (81.5 KB)
After upgrading to v0.0.18 there are paragraphs that moved from the middle / bottom of the page to the top of the page.
The conversion script can follow the examples in the documentation:
```python
import pathlib
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(“input.pdf”)
pathlib.Path(“output.md”).write_bytes(md_text.encode())
```
I originally planned to include output files for v0.0.17, v0.0.18, and the latest v0.0.27, but I am limited to 2 links.
Here is a screenshot of the git diff changing between v0.0.17 and v0.0.18, with the undesired relocated paragraphs:
Hope this helps!