Hi everyone,
I have enabled OCR as explained here:
https://pymupdf.readthedocs.io/en/latest/recipes-ocr.html
I’m using the following code to extract information from a PDF:
md_text_pymupdf = pymupdf4llm.to_markdown(
doc=doc,
page_chunks=True,
write_images=False,
embed_images=True,
show_progress=False
)
My question is:
-
Does pymupdf4llm.to_markdown automatically apply OCR when processing the document?
-
Or do I need to iterate through each page and run OCR manually before calling to_markdown?
-
Or should I extract every image from the PDF and run Tesseract OCR on them separately?
Thanks!
I think you’d need to iterate through each page and get a TextPage object - Page - PyMuPDF 1.26.3 documentation - then create a new PDF from the TextPage objects. Then use that new document for input into PyMuPDF4LLM.
@HaraldLieder Is that about right or is there a better way?
1 Like
Yes!
- There is no automatic OCR, you have to do that on your own.
- Best process the file before running pymupdf4llm with it.
You could do this as the preliminary step:
doc=pymupdf.open("input.pdf")
new=pymupdf.open() # contains the OCR-ed PDF
for page in doc:
pix = page.get_pixmap(dpi=150)
pdfdata = pix.pdfocr_tobytes()
temp = pymupdf.open("pdf", pdfdata)
new.insert_pdf(temp)
temp.close()
# Document "new" is the OCR-ed PDF. Go ahead using it:
md = pymupdf4llm.to_markdown(new, ...)
...
1 Like