Pymupdf layout table detection issue

alex · February 6, 2026, 10:36pm

Hello,

I just found out about pymupdf layout. Until now i was using pymupdf4llm but adding pymupdf layout to it seems relevant.

It would be very interesting for my commercial project because i need an accurate pdf extraction to markdown, especially for tables.

But i see some limitation.

For example, here, the caption under the table is detected as part of the table:

Here, there are two tables and captions but it’s detected as a single table:

Are you aware of this kind issues and is it something that is planned to be fixed? Do you plan to train it on more datas? Are there any alternatives that could work for my case?

Jamie_Lemon · February 6, 2026, 11:49pm

Hi @alex Welcome to the forum! Yes, we are aware of anomalies like this and aim to release new versions with more training data in the future. If possible could you share your PDF here?, that way I can try to replicate and log your issue on our issue board.
Furthermore, you may be interested in our new Discord server: PyMuPDF4LLM where you can get more news as it happens and find out more about the product evolution and general AI topics.

alex · February 7, 2026, 10:17pm

lmh1239.pdf (2,5 Mo)

Infineon_03_25_2025_DS_EZ_USB_FX20-3575870.pdf (722,4 Ko)

Thank you for yout quick answer. Both of these documents have the kind of issues i described.

Jamie_Lemon · February 12, 2026, 5:26pm

Hi @alex sorry for the late reply, I’ve just been trying with the latest versions of PyMuPDF & PyMuPDF-layout (1.27.1) released yesterday and with the attached script, and:

python test-4llm.py lmh1239.pdf

Note, I only looked at two pages in there as it looked like they were tables with captions above and below. The MD result looked good to me.

Can you verify and if possible can you tell me pages to parse which exhibit the problem?

test-4llm.py (479 Bytes)

alex · February 13, 2026, 2:08pm

lmh1229.md (119,7 Ko)

lmh1239.pdf (2,5 Mo)

Hi @Jamie_Lemon ,

Thank for replying.
I’ve updated PyMuPDF & PyMuPDF-layout to the latest versions.
The detection seems more accurate regarding tables and pictures. Could you tell me what have changed?

I see nothing regarding pymupdf-layout in change.txt file.

However, i still notice some issues:

In the lmh1239.pdf file, it seems that pymupdf sees one in the first page column but as you can see, there are two columns. The md file shows that the first page was treated as a single column page. Pymupdf layout demo shows the problem:

image593×649 118 KB

For my project, i’ll have to feed llm with this kind of file and i need it to be very accurate to avoid hallucination. But overall, i think pymupdf is the best and the fastest pdf to markdown converter i’ve tested.

alex · February 13, 2026, 2:09pm

(divided the previous message in 2 because i can only link two files)

In the file infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.pdf , i notice that the structure of page 32 is not detected (please check the md file of this document

infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.md (163,4 Ko)

):

image1106×773 155 KB
Same file on page 82, the table structure doesn’t seem to be recognize
Here, page 81, the picture was treated as a table :

image547×766 91.3 KB

Jamie_Lemon · February 17, 2026, 4:15pm

@alex

I’m testing those pages you identified on the document, here are my findings with the following code:


import sys
import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

try:
  from importlib.metadata import version
  layout_version_info = f"pymupdf-layout version={version'pymupdf-layout')}"
  print(layout_version_info)
except Exception as e:
  print(f"Could not determine version: {e}")

doc = pymupdf.open(sys.argv[1])
md = pymupdf4llm.to_markdown(
  doc,
  show_progress=True,
  pages=[31, 80, 81],
  header=False,  # include/omit page headers
  footer=False,  # include/omit page footers
  page_separators=True
  )

from pathlib import Path
suffix = ".md"
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())

Note, my Pymupdf versions are as follows:

pymupdf.version=(‘1.27.1’, ‘1.27.1’, None), pymupdf4llm.version=‘0.3.4’

pymupdf-layout version=1.27.1

I thought that:

page 32 - the table extraction was pretty faithful
page 81 - agree the diagram is attempted to be extracted as a table ( I guess this is because the grid line structure of it seems like a table )
page 82 - the table extraction was pretty faithful

I attach my MD as well.

infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.md (6.8 KB)

alex · February 17, 2026, 8:37pm

Hi,

I’m using the same version of pymudf :

$ .venv/Scripts/python test-4llm.py lmh1239.pdf
pymupdf.version=(‘1.27.1’, ‘1.27.1’, None), pymupdf4llm.version=‘0.3.4’

How is it possible that we have different results while using the same code:

infineon-cyusb402x-ez-usb-fx20-usb-20-gbps-peripheral-controller-datasheet-en.md (9,4 Ko)

I end up with many wrong characters in the first table (edit : i guess you used ocr)

Also, have you looked at my message from last week where i talk about column issues?

Could you tell how you plan to improve this tool for the future?

Jamie_Lemon · February 17, 2026, 9:54pm

Can you confirm the version of Pymupdf-layout that you are using?

I will look into the column issue. Essentially the plan to improve the tool is to achieve better model training and this is something we are steadily working towards.

alex · February 18, 2026, 9:01am

I’m using pymupdf-layout version 1.27.1.
Could you confirm that you’re using OCR?

Jamie_Lemon · February 18, 2026, 1:41pm

I’m not force using OCR.

alex · February 18, 2026, 2:53pm

Same setup, same module version, same file, same python script, no ocr, but two different results,…very surprised at this.

Actually, i have the same results when i copy paste the characters in the table directly from the pdf. That why i’m very surprised that you didn’t use OCR…

Jamie_Lemon · February 24, 2026, 3:27pm

@alex I will file this issue on our backlog. It could be our training model needs to be updated further.

alex · February 24, 2026, 3:51pm

Hi @Jamie_Lemon

I installed tesseract and i was able to get the characters from the table.

Jamie_Lemon · February 24, 2026, 3:56pm

Aha! Thanks @alex I knew we must have been missing something - with Tesseract it will OCR when certain criteria are met - you can also force OCR on pages if you want to. Also, with the latest release, you can use an alternative OCR Engine should you wish to - see: PyMuPDF Layout - PyMuPDF documentation for more.
Finally (for now) - if you are a Discord user please join our PyMuPDF4LLM server here: PyMuPDF4LLM - that is one of the best ways to stay in touch with the realtime development.

Topic		Replies	Views
Bug: pymupdf4llm: mis-interpreted layout and IndexError on specific pages (insurance policy PDF) PyMuPDF	5	43	January 6, 2026
BUG: list index out of range using new layout feature PyMuPDF	16	89	December 11, 2025
BUG: double column pdfs text extracted in wrong order PyMuPDF	2	46	January 16, 2026
BUG: pymupdf4llm list index out of range in document_layout.py (2) PyMuPDF	3	49	December 4, 2025
OCR disabled because OpenCV not installed PyMuPDF	16	127	January 6, 2026

Pymupdf layout table detection issue

Related topics