Pymupdf4llm forcing re-OCR, on doc that has ocr_spans

This is a PDF where OCR has already been done on it:

Whenever I run pymupdf4llm.to_markdown, it determines it needs to re-OCR it.

Here is the output of pymupdf4llm.helpers.utils.analyze_page:
```
{‘covered’: Rect(0.0, 0.0, 471.1199951171875, 705.8400268554688), ‘img_joins’: 1.0, ‘img_area’: 1.0, ‘txt_joins’: 0.0, ‘txt_area’: 0.0, ‘vec_area’: 0.0, ‘vec_joins’: 0.0, ‘chars_total’: 0, ‘chars_bad’: 0, ‘ocr_spans’: 81, ‘img_var’: 3535.4526310813867, ‘img_edges’: 5.880182575502953, ‘vec_suspicious’: 0, ‘needs_ocr’: True, ‘reason’: ‘ocr_spans’}
```

Here are some of the ocr_spans:
Text: SERVICE TO SINGLE ADULTS—Age 3 | char_flags: 8 | stroked: False | filled: False
Text: A CASE STUDY* | char_flags: 8 | stroked: False | filled: False
Text: by | char_flags: 0 | stroked: False | filled: False
Text: CHAKLES GARVIN | char_flags: 0 | stroked: False | filled: False

Why is it not just using these ocr_spans?

@digger250 Welcome to the forum!
I think there might be a mis-understanding here - OCR extracted text is not stored inside a PDF. So the PDF had not already been OCR’ed. Your PDF file still consists of large scanned images for each page. The analyze_pagemethod is figuring out if OCR should be required and it sees the number of discovered ocr_spans as being the reason to use OCR.

ocr_spans, according to the code comment is “text spans with ignored text”.

See pymupdf4llm/src/helpers/utils.py at af6b222894a5d0053e5fcf5195ecaf6155b33f4a · pymupdf/pymupdf4llm · GitHub

That comment seems to contradict you assertion that “OCR extracted text is not stored inside a PDF”

I don’t understand why the code is ignoring those text spans, rather than using them as the output. when I use fitz directly (get_text), it returns those spans. I would like pymupdf4llm to do the same.

My bad!
It appears I have mis-understood this and, yes, there is embedded text in the spans.

If I do this:

import pymupdf
import pymupdf4llm

doc = pymupdf.open(“Service to Single Adults.pdf”)
md = pymupdf4llm.to_markdown(doc, use_ocr=False) #or to_text
print(md)

I get text data as expected. However, I guess you are asking why using OCR is suggested at all whilst we have this text data - @HaraldLieder are you able to explain this or cast some light on what is happening?

When I run that code, I only get markdown header indicators “#”, but no text other than that.

@digger250 Interesting can you confirm your versions of PyMuPDF & PyMuPDF4LLM?

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
pymupdf.version=(‘1.27.2.2’, ‘1.27.2’, None), pymupdf4llm.version=‘1.27.2.2’

This has to do with the current default settings - which in the case of old OCR text assumes that you want to re-OCR the page.
You should be able to control this by using the parameter use_ocr=2. That parameter is now an integer implemented as a Python ENUM object:

Thanks @HaraldLieder, unfortunately, setting that to 2, causes OCR (Tesseract by default) to be invoked, which is what I’m trying to avoid.

Hm, ok, thanks for the info. We obviously need to adjust this piece of logic!