Pymupdf layout table detection issue

Hello,

I just found out about pymupdf layout. Until now i was using pymupdf4llm but adding pymupdf layout to it seems relevant.

It would be very interesting for my commercial project because i need an accurate pdf extraction to markdown, especially for tables.

But i see some limitation.

For example, here, the caption under the table is detected as part of the table:

image

Here, there are two tables and captions but it’s detected as a single table:

Are you aware of this kind issues and is it something that is planned to be fixed? Do you plan to train it on more datas? Are there any alternatives that could work for my case?

Hi @alex Welcome to the forum! Yes, we are aware of anomalies like this and aim to release new versions with more training data in the future. If possible could you share your PDF here?, that way I can try to replicate and log your issue on our issue board.
Furthermore, you may be interested in our new Discord server: PyMuPDF4LLM where you can get more news as it happens and find out more about the product evolution and general AI topics.

lmh1239.pdf (2,5 Mo)

Infineon_03_25_2025_DS_EZ_USB_FX20-3575870.pdf (722,4 Ko)

Thank you for yout quick answer. Both of these documents have the kind of issues i described.

Hi @alex sorry for the late reply, I’ve just been trying with the latest versions of PyMuPDF & PyMuPDF-layout (1.27.1) released yesterday and with the attached script, and:

python test-4llm.py lmh1239.pdf

Note, I only looked at two pages in there as it looked like they were tables with captions above and below. The MD result looked good to me.

Can you verify and if possible can you tell me pages to parse which exhibit the problem?

test-4llm.py (479 Bytes)

lmh1229.md (119,7 Ko)

lmh1239.pdf (2,5 Mo)

Hi @Jamie_Lemon ,

Thank for replying.
I’ve updated PyMuPDF & PyMuPDF-layout to the latest versions.
The detection seems more accurate regarding tables and pictures. Could you tell me what have changed?

I see nothing regarding pymupdf-layout in change.txt file.

However, i still notice some issues:

  • In the lmh1239.pdf file, it seems that pymupdf sees one in the first page column but as you can see, there are two columns. The md file shows that the first page was treated as a single column page. Pymupdf layout demo shows the problem:

For my project, i’ll have to feed llm with this kind of file and i need it to be very accurate to avoid hallucination. But overall, i think pymupdf is the best and the fastest pdf to markdown converter i’ve tested.

(divided the previous message in 2 because i can only link two files)

@alex

I’m testing those pages you identified on the document, here are my findings with the following code:


import sys
import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

try:
  from importlib.metadata import version
  layout_version_info = f"pymupdf-layout version={version'pymupdf-layout')}"
  print(layout_version_info)
except Exception as e:
  print(f"Could not determine version: {e}")

doc = pymupdf.open(sys.argv[1])
md = pymupdf4llm.to_markdown(
  doc,
  show_progress=True,
  pages=[31, 80, 81],
  header=False,  # include/omit page headers
  footer=False,  # include/omit page footers
  page_separators=True
  )

from pathlib import Path
suffix = ".md"
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())

Note, my Pymupdf versions are as follows:

pymupdf.version=(‘1.27.1’, ‘1.27.1’, None), pymupdf4llm.version=‘0.3.4’

pymupdf-layout version=1.27.1

I thought that:

  • page 32 - the table extraction was pretty faithful
  • page 81 - agree the diagram is attempted to be extracted as a table ( I guess this is because the grid line structure of it seems like a table )
  • page 82 - the table extraction was pretty faithful

I attach my MD as well.

infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.md (6.8 KB)

Hi,

I’m using the same version of pymudf :

$ .venv/Scripts/python test-4llm.py lmh1239.pdf
pymupdf.version=(‘1.27.1’, ‘1.27.1’, None), pymupdf4llm.version=‘0.3.4’

How is it possible that we have different results while using the same code:

infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.md (9,4 Ko)

I end up with many wrong characters in the first table (edit : i guess you used ocr)

Also, have you looked at my message from last week where i talk about column issues?

Could you tell how you plan to improve this tool for the future?

Can you confirm the version of Pymupdf-layout that you are using?

I will look into the column issue. Essentially the plan to improve the tool is to achieve better model training and this is something we are steadily working towards.

I’m using pymupdf-layout version 1.27.1.
Could you confirm that you’re using OCR?

I’m not force using OCR.

Same setup, same module version, same file, same python script, no ocr, but two different results,…very surprised at this.

Actually, i have the same results when i copy paste the characters in the table directly from the pdf. That why i’m very surprised that you didn’t use OCR…

@alex I will file this issue on our backlog. It could be our training model needs to be updated further.

Hi @Jamie_Lemon

I installed tesseract and i was able to get the characters from the table.

Aha! Thanks @alex I knew we must have been missing something - with Tesseract it will OCR when certain criteria are met - you can also force OCR on pages if you want to. Also, with the latest release, you can use an alternative OCR Engine should you wish to - see: PyMuPDF Layout - PyMuPDF documentation for more.
Finally (for now) - if you are a Discord user please join our PyMuPDF4LLM server here: PyMuPDF4LLM - that is one of the best ways to stay in touch with the realtime development.