@digger250 Welcome to the forum!
I think there might be a mis-understanding here - OCR extracted text is not stored inside a PDF. So the PDF had not already been OCR’ed. Your PDF file still consists of large scanned images for each page. The analyze_pagemethod is figuring out if OCR should be required and it sees the number of discovered ocr_spans as being the reason to use OCR.
That comment seems to contradict you assertion that “OCR extracted text is not stored inside a PDF”
I don’t understand why the code is ignoring those text spans, rather than using them as the output. when I use fitz directly (get_text), it returns those spans. I would like pymupdf4llm to do the same.
My bad!
It appears I have mis-understood this and, yes, there is embedded text in the spans.
If I do this:
import pymupdf
import pymupdf4llm
doc = pymupdf.open(“Service to Single Adults.pdf”)
md = pymupdf4llm.to_markdown(doc, use_ocr=False) #or to_text
print(md)
I get text data as expected. However, I guess you are asking why using OCR is suggested at all whilst we have this text data - @HaraldLieder are you able to explain this or cast some light on what is happening?
This has to do with the current default settings - which in the case of old OCR text assumes that you want to re-OCR the page.
You should be able to control this by using the parameter use_ocr=2. That parameter is now an integer implemented as a Python ENUM object: