Pymupdf4llm forcing re-OCR, on doc that has ocr_spans

This is a PDF where OCR has already been done on it:

Whenever I run pymupdf4llm.to_markdown, it determines it needs to re-OCR it.

Here is the output of pymupdf4llm.helpers.utils.analyze_page:
```
{‘covered’: Rect(0.0, 0.0, 471.1199951171875, 705.8400268554688), ‘img_joins’: 1.0, ‘img_area’: 1.0, ‘txt_joins’: 0.0, ‘txt_area’: 0.0, ‘vec_area’: 0.0, ‘vec_joins’: 0.0, ‘chars_total’: 0, ‘chars_bad’: 0, ‘ocr_spans’: 81, ‘img_var’: 3535.4526310813867, ‘img_edges’: 5.880182575502953, ‘vec_suspicious’: 0, ‘needs_ocr’: True, ‘reason’: ‘ocr_spans’}
```

Here are some of the ocr_spans:
Text: SERVICE TO SINGLE ADULTS—Age 3 | char_flags: 8 | stroked: False | filled: False
Text: A CASE STUDY* | char_flags: 8 | stroked: False | filled: False
Text: by | char_flags: 0 | stroked: False | filled: False
Text: CHAKLES GARVIN | char_flags: 0 | stroked: False | filled: False

Why is it not just using these ocr_spans?

@digger250 Welcome to the forum!
I think there might be a mis-understanding here - OCR extracted text is not stored inside a PDF. So the PDF had not already been OCR’ed. Your PDF file still consists of large scanned images for each page. The analyze_pagemethod is figuring out if OCR should be required and it sees the number of discovered ocr_spans as being the reason to use OCR.

ocr_spans, according to the code comment is “text spans with ignored text”.

See pymupdf4llm/src/helpers/utils.py at af6b222894a5d0053e5fcf5195ecaf6155b33f4a · pymupdf/pymupdf4llm · GitHub

That comment seems to contradict you assertion that “OCR extracted text is not stored inside a PDF”

I don’t understand why the code is ignoring those text spans, rather than using them as the output. when I use fitz directly (get_text), it returns those spans. I would like pymupdf4llm to do the same.

My bad!
It appears I have mis-understood this and, yes, there is embedded text in the spans.

If I do this:

import pymupdf
import pymupdf4llm

doc = pymupdf.open(“Service to Single Adults.pdf”)
md = pymupdf4llm.to_markdown(doc, use_ocr=False) #or to_text
print(md)

I get text data as expected. However, I guess you are asking why using OCR is suggested at all whilst we have this text data - @HaraldLieder are you able to explain this or cast some light on what is happening?

When I run that code, I only get markdown header indicators “#”, but no text other than that.

@digger250 Interesting can you confirm your versions of PyMuPDF & PyMuPDF4LLM?

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
pymupdf.version=(‘1.27.2.2’, ‘1.27.2’, None), pymupdf4llm.version=‘1.27.2.2’

This has to do with the current default settings - which in the case of old OCR text assumes that you want to re-OCR the page.
You should be able to control this by using the parameter use_ocr=2. That parameter is now an integer implemented as a Python ENUM object: