Problem with pymupdf4llm.to_markdown

Lauro · March 16, 2026, 9:43am

While I do not have problem extracting the text from a rectangular clip area with pymupdf, I cannot do the same with pymupdf4llm, as suggested in The PyMuPDF4LLM API - PyMuPDF documentation

I can do successfully:
import fitz
rect = fitz.Rect(33, 52, 366, 585)
doc = fitz.open(pdf_file)
…
and for every pages
tp = page.get_textpage(clip=rect)

I cannot do a similar thing with pymupdf4llm, because no clipping is done:
import pymupdf
import pymupdf4llm
pymupdf4llm.use_layout=False
testo_estratto = pymupdf4llm.to_markdown(“Treatise_Book_1.pdf”,margins=(33, 52, 366, 585))

Do you have any suggestions?
Thanks Lauro

HaraldLieder · March 16, 2026, 10:11am

Welcome here, @Lauro !

The margins parameter is supported in non-layout mode only. It specifies the width/height of the borders on a page that should be disregarded. So, looking at your example, the values 366 and 585 are definitely incorrect.
Using margins=(72, 72, 72, 72) would ignore 1-inch borders at all four edges.

For layout mode, the only solution I can think of is fiddling with the CropBox of the pages: before each page is processed, do page.set_cropbox(page.rect + (72, 72, -72, -72)).
Roughly this approach:

import pymupdf4llm
import pymupdf
from pathlib import Path

doc = pymupdf.open("input.pdf")
md_text = ""
margins = (72, 72, -72, -72)

for page in doc:
    page.set_cropbox(page.rect + margins)
    md_text += pymupdf4llm.to_markdown(doc, pages=[page.number], ...)

Path(doc.name).with_suffix(".md").write_text(md_text)

We (temporarily) modify each page here. This is necessary because the Page object is passed to the Layout plugin.

As a consequence, this approach can only work for PDF documents.

Lauro · March 17, 2026, 5:24pm

Thank you very much!
I misinterpreted the specs…
Lauro

Topic		Replies	Views
BUG: parameter page_chunks is ignored when passed to pymupdf4llm.to_markdown PyMuPDF	2	27	December 8, 2025
Some drawings missing from pymupdf4llm output PyMuPDF	3	40	March 2, 2026
BUG: pymupdf4llm list index out of range in document_layout.py PyMuPDF	9	50	December 2, 2025
BUG: list index out of range using new layout feature PyMuPDF	16	89	December 11, 2025
Why is this graphic NOT extracted as images by pymupdf4llm.to_markdown(write_images=True) PyMuPDF	5	72	July 22, 2025

Problem with pymupdf4llm.to_markdown

Related topics