Does pymupdf4llm.to_markdown automatically use OCR?

Hi everyone,

I have enabled OCR as explained here:
https://pymupdf.readthedocs.io/en/latest/recipes-ocr.html

I’m using the following code to extract information from a PDF:

md_text_pymupdf = pymupdf4llm.to_markdown(
    doc=doc,
    page_chunks=True,
    write_images=False,
    embed_images=True,
    show_progress=False
)

My question is:

  • Does pymupdf4llm.to_markdown automatically apply OCR when processing the document?

  • Or do I need to iterate through each page and run OCR manually before calling to_markdown?

  • Or should I extract every image from the PDF and run Tesseract OCR on them separately?

Thanks!

I think you’d need to iterate through each page and get a TextPage object - Page - PyMuPDF 1.26.3 documentation - then create a new PDF from the TextPage objects. Then use that new document for input into PyMuPDF4LLM.

@HaraldLieder Is that about right or is there a better way?


1 Like

Yes!

  1. There is no automatic OCR, you have to do that on your own.
  2. Best process the file before running pymupdf4llm with it.

You could do this as the preliminary step:

doc=pymupdf.open("input.pdf")
new=pymupdf.open()  # contains the OCR-ed PDF
for page in doc:
    pix = page.get_pixmap(dpi=150)
    pdfdata = pix.pdfocr_tobytes()
    temp = pymupdf.open("pdf", pdfdata)
    new.insert_pdf(temp)
    temp.close()
# Document "new" is the OCR-ed PDF. Go ahead using it:
md = pymupdf4llm.to_markdown(new, ...)
...
1 Like