Are you aware of this kind issues and is it something that is planned to be fixed? Do you plan to train it on more datas? Are there any alternatives that could work for my case?
Hi @alex Welcome to the forum! Yes, we are aware of anomalies like this and aim to release new versions with more training data in the future. If possible could you share your PDF here?, that way I can try to replicate and log your issue on our issue board.
Furthermore, you may be interested in our new Discord server: PyMuPDF4LLM where you can get more news as it happens and find out more about the product evolution and general AI topics.
Hi @alex sorry for the late reply, I’ve just been trying with the latest versions of PyMuPDF & PyMuPDF-layout (1.27.1) released yesterday and with the attached script, and:
python test-4llm.py lmh1239.pdf
Note, I only looked at two pages in there as it looked like they were tables with captions above and below. The MD result looked good to me.
Can you verify and if possible can you tell me pages to parse which exhibit the problem?
Thank for replying.
I’ve updated PyMuPDF & PyMuPDF-layout to the latest versions.
The detection seems more accurate regarding tables and pictures. Could you tell me what have changed?
I see nothing regarding pymupdf-layout in change.txt file.
However, i still notice some issues:
In the lmh1239.pdf file, it seems that pymupdf sees one in the first page column but as you can see, there are two columns. The md file shows that the first page was treated as a single column page. Pymupdf layout demo shows the problem:
For my project, i’ll have to feed llm with this kind of file and i need it to be very accurate to avoid hallucination. But overall, i think pymupdf is the best and the fastest pdf to markdown converter i’ve tested.
Can you confirm the version of Pymupdf-layout that you are using?
I will look into the column issue. Essentially the plan to improve the tool is to achieve better model training and this is something we are steadily working towards.
Same setup, same module version, same file, same python script, no ocr, but two different results,…very surprised at this.
Actually, i have the same results when i copy paste the characters in the table directly from the pdf. That why i’m very surprised that you didn’t use OCR…
Aha! Thanks @alex I knew we must have been missing something - with Tesseract it will OCR when certain criteria are met - you can also force OCR on pages if you want to. Also, with the latest release, you can use an alternative OCR Engine should you wish to - see: PyMuPDF Layout - PyMuPDF documentation for more.
Finally (for now) - if you are a Discord user please join our PyMuPDF4LLM server here: PyMuPDF4LLM - that is one of the best ways to stay in touch with the realtime development.