Pymupdf4llm forcing re-OCR, on doc that has ocr_spans

digger250 · April 10, 2026, 5:39pm

This is a PDF where OCR has already been done on it:

Whenever I run pymupdf4llm.to_markdown, it determines it needs to re-OCR it.

Here is the output of pymupdf4llm.helpers.utils.analyze_page:
```
{‘covered’: Rect(0.0, 0.0, 471.1199951171875, 705.8400268554688), ‘img_joins’: 1.0, ‘img_area’: 1.0, ‘txt_joins’: 0.0, ‘txt_area’: 0.0, ‘vec_area’: 0.0, ‘vec_joins’: 0.0, ‘chars_total’: 0, ‘chars_bad’: 0, ‘ocr_spans’: 81, ‘img_var’: 3535.4526310813867, ‘img_edges’: 5.880182575502953, ‘vec_suspicious’: 0, ‘needs_ocr’: True, ‘reason’: ‘ocr_spans’}
```

Why is it not just using these ocr_spans?

Jamie_Lemon · April 13, 2026, 2:47pm

@digger250 Welcome to the forum!
I think there might be a mis-understanding here - OCR extracted text is not stored inside a PDF. So the PDF had not already been OCR’ed. Your PDF file still consists of large scanned images for each page. The analyze_pagemethod is figuring out if OCR should be required and it sees the number of discovered ocr_spans as being the reason to use OCR.

digger250 · April 13, 2026, 3:13pm

ocr_spans, according to the code comment is “text spans with ignored text”.

See pymupdf4llm/src/helpers/utils.py at af6b222894a5d0053e5fcf5195ecaf6155b33f4a · pymupdf/pymupdf4llm · GitHub

That comment seems to contradict you assertion that “OCR extracted text is not stored inside a PDF”

I don’t understand why the code is ignoring those text spans, rather than using them as the output. when I use fitz directly (get_text), it returns those spans. I would like pymupdf4llm to do the same.

Jamie_Lemon · April 13, 2026, 3:26pm

My bad!
It appears I have mis-understood this and, yes, there is embedded text in the spans.

If I do this:

import pymupdf
import pymupdf4llm

doc = pymupdf.open(“Service to Single Adults.pdf”)
md = pymupdf4llm.to_markdown(doc, use_ocr=False) #or to_text
print(md)

I get text data as expected. However, I guess you are asking why using OCR is suggested at all whilst we have this text data - @HaraldLieder are you able to explain this or cast some light on what is happening?

digger250 · April 13, 2026, 3:44pm

When I run that code, I only get markdown header indicators “#”, but no text other than that.

Jamie_Lemon · April 13, 2026, 4:04pm

@digger250 Interesting can you confirm your versions of PyMuPDF & PyMuPDF4LLM?

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

digger250 · April 13, 2026, 4:30pm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
pymupdf.version=(‘1.27.2.2’, ‘1.27.2’, None), pymupdf4llm.version=‘1.27.2.2’

HaraldLieder · April 14, 2026, 12:17am

This has to do with the current default settings - which in the case of old OCR text assumes that you want to re-OCR the page.
You should be able to control this by using the parameter use_ocr=2. That parameter is now an integer implemented as a Python ENUM object:

digger250 · April 16, 2026, 3:17pm

Thanks @HaraldLieder, unfortunately, setting that to 2, causes OCR (Tesseract by default) to be invoked, which is what I’m trying to avoid.

HaraldLieder · April 17, 2026, 1:02am

Hm, ok, thanks for the info. We obviously need to adjust this piece of logic!

Topic		Replies	Views
Does pymupdf4llm.to_markdown automatically use OCR? PyMuPDF	2	173	August 14, 2025
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	12	60	May 6, 2026
OCR disabled because OpenCV not installed PyMuPDF	16	146	January 6, 2026
For scanned documents: PyMuPDFPro PyMuPDF	1	55	January 8, 2026
Pymupdf layout table detection issue PyMuPDF	14	141	February 24, 2026

Pymupdf4llm forcing re-OCR, on doc that has ocr_spans

Related topics