Some drawings missing from pymupdf4llm output

Hello,

I’m using pymupdf4llm to extract PDF contents as Markdown. I’m observing that for one of my documents there are some drawings missing from the output. Specifically, Figures 4(a) and 4(b) on page 6 of my document are missing (PDF document: 2601.05047v3).

I’ve tried following the recommendations from [Bug] A specific diagram recognized as significant is not extracted as images by pymupdf4llm.to_markdown · Issue #296 · pymupdf/pymupdf4llm. If I use page.get_drawings() the drawings are detected correctly, as shown in the screenshot below. However, using page.cluster_drawings() I only get back a bbox for the entire page.

Interestingly, Figures 2(a) and 2(b) on page 3, which appear very similar to Figures 4(a)-(b), are included in the output.

The code I am using:

import pymupdf.layout

import pymupdf4llm

input_pdf = Path("./data/2601.05047v3.pdf")

md = pymupdf4llm.to_markdown(input_pdf, embed_images=True)

with open("./data/output/2601.05047v3.md", "w", encoding="utf-8") as f:

    f.write(md)

One more observation, I noticed if I remove the pymupdf.layout import, while the overall output is degraded, the missing drawings are included.

Any assistance would be greatly appreciated!

Hi @rdelaney Can you confirm the versions of PyMuPDF, PyMuPDF Layout and PyMuPDF4LLM so I can try and replicate your case?

1 Like

Ah, it looks like I was using pymupdf v1.26.6. Upgrading to v1.27.1 has resolved the issue for me. Thank you!

1 Like

Super - great to hear!