Does pymupdf4llm.to_markdown automatically use OCR?

Matteo_Doni · August 14, 2025, 2:39pm

Hi everyone,

I have enabled OCR as explained here:
https://pymupdf.readthedocs.io/en/latest/recipes-ocr.html

I’m using the following code to extract information from a PDF:

md_text_pymupdf = pymupdf4llm.to_markdown(
    doc=doc,
    page_chunks=True,
    write_images=False,
    embed_images=True,
    show_progress=False
)

My question is:

Does pymupdf4llm.to_markdown automatically apply OCR when processing the document?
Or do I need to iterate through each page and run OCR manually before calling to_markdown?
Or should I extract every image from the PDF and run Tesseract OCR on them separately?

Thanks!

Jamie_Lemon · August 14, 2025, 3:16pm

I think you’d need to iterate through each page and get a TextPage object - Page - PyMuPDF 1.26.3 documentation - then create a new PDF from the TextPage objects. Then use that new document for input into PyMuPDF4LLM.

@HaraldLieder Is that about right or is there a better way?

HaraldLieder · August 14, 2025, 4:11pm

Yes!

There is no automatic OCR, you have to do that on your own.
Best process the file before running pymupdf4llm with it.

You could do this as the preliminary step:

doc=pymupdf.open("input.pdf")
new=pymupdf.open()  # contains the OCR-ed PDF
for page in doc:
    pix = page.get_pixmap(dpi=150)
    pdfdata = pix.pdfocr_tobytes()
    temp = pymupdf.open("pdf", pdfdata)
    new.insert_pdf(temp)
    temp.close()
# Document "new" is the OCR-ed PDF. Go ahead using it:
md = pymupdf4llm.to_markdown(new, ...)
...

Topic		Replies	Views
OCR disabled because OpenCV not installed PyMuPDF	16	112	January 6, 2026
For scanned documents: PyMuPDFPro PyMuPDF	1	27	January 8, 2026
Pymupdf4llm parsing takes excessively long time PyMuPDF	2	52	December 4, 2025
Issue: Hyperlink extraction from pdf to markdown is not working PyMuPDF text	1	16	January 8, 2026
how to preserve page structure( table, layout) while doing OCR PyMuPDF	3	101	September 10, 2025

Does pymupdf4llm.to_markdown automatically use OCR?

Related topics