OCR disabled because OpenCV not installed

I’m trying to extract clean text from unstructured PDF using PyMuPDF4LLM (and layout) with Python in a data bricks environment and I keep getting the error “OCR disabled because OpenCV not installed”.

But it is installed! I’ve tried normal and headless, and I just can’t win! Here’s my code:

It does perform an extraction, but I would like to use the full functionality if it’s available.

Any suggestions?

import numpy as np
import fitz
import pymupdf.layout
import pymupdf4llm
import cv2

print(pymupdf.__doc__)
print(cv2.__doc__)
print(np.__version__)

doc = pymupdf.open("/path/to/file.pdf")

json = pymupdf4llm.to_json(doc)

print(json)

And this is my output:


PyMuPDF 1.26.6: Python bindings for the MuPDF 1.26.11 library (rebased implementation).
Python 3.12 running on linux (64-bit).


OpenCV Python binary extension loader

2.1.3
OCR disabled because OpenCV not installed.
{
 "filename": "</path/to/redacted.pdf>",
 "page_count": 2,
 "toc": [
  [
   1,
   "A Framework for .....  etc.


What version of PyMuPDF4LLM are you using? (note also in your code you don’t need the import fitz line here or the cv2 import - they will be imported along with PyMuPDF Layout & PyMuPDF4LLM )

Versions are the most recent installed via:

pip install PyMuPDF pymupdf-layout pymupdf4llm opencv-python

I had left those in from trying to debug, I’ll remove them from my code, thanks :slight_smile:

It’s now just:

import os
import pymupdf.layout # (fitz)
import pymupdf4llm


Okay, so in Python if you do:

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

You get:

pymupdf.version=('1.26.6', '1.26.11', None), pymupdf4llm.version='0.2.7'

Just want to make 100% sure. (Also please note we don’t use the “fitz” for sometime in the API! - PyMuPDF 1.24.3 and Farewell to “Fitz” | Artifex )

It gives this:

pymupdf.version=(‘1.26.6’, ‘1.26.11’, None), pymupdf4llm.version=‘0.2.7’

However, having restarted my cluster over the weekend (it timed out), it seems to work. I now get a list index out of range for the larger documents. Could this be a bug in my code or a limit of the extractor??

******Processing (pdf):  (Approx. 2101.0 words)******
    >Standard PDF detected.  Trying enhanced extraction path
        >Attempting enhanced extraction (PyMuPDF4LLM)
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.

Bigger documents return this:

******Processing (pdf):  (Approx. 171787.0 words)******
    >Standard PDF detected.  Trying enhanced extraction path
        >Attempting enhanced extraction (PyMuPDF4LLM)
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.

        >WARNING: PyMuPDF4LLM Enhanced extraction failed with error: list index out of range. Falling back to standard PyMuPDF.

        >Attempting standard extraction (PyMuPDF)
        >Standard PDF extraction succeeded

    >Completed: pdf. Status: ERROR: Enhanced Extraction failed.  Fallback standard success. Steps: PyMuPDF4LLM_Error_Fallback | Error: list index out of range.  7 of 12 files.

@MW_UK Thanks for the info and for confirming your versions. I believe the error list index out of range in this context relates to a problem with parsing table content and the enhanced extraction (with PyMuPDF Layout), so it then falls back to standard PyMuPDF.
So I don’t think it is an error in your code here.

Are you able to identify which PDFs the enhanced extraction fails on and the page number?

No worries, thanks for your help!

The PDF in question is: https://assets.publishing.service.gov.uk/media/603539438fa8f54816a78968/scho0909bqyv-e-e.pdf

Running the following code gives the following output & error message:


import pymupdf
import pymupdf.layout
import pymupdf4llm

doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")

json = pymupdf4llm.to_markdown(doc, show_progress=True)

print(json)


Parsing 335 pages of '/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf'...
100%|██████████| 335/335 [01:23<00:00,  3.99it/s]
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.

Generating markdown text...
 55%|█████▍    | 184/335 [00:00<00:00, 7512.36it/s]




[Trace ID: 00-f12546ebec2a9d46a722b043f77be33d-45513a062f141b4d-00]
File <command-6632665700678967>, line 8
      3 import pymupdf4llm
      6 doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")
----> 8 json = pymupdf4llm.to_markdown(doc, show_progress=True)
     10 print(json)


File <command-6632665700678967>, line 8
      3 import pymupdf4llm
      6 doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")
----> 8 json = pymupdf4llm.to_markdown(doc, show_progress=True)
     10 print(json)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/__init__.py:97, in to_markdown(doc, header, footer, pages, write_images, embed_images, image_path, image_format, filename, force_text, page_chunks, page_separators, dpi, ocr_dpi, page_width, page_height, ignore_code, show_progress, use_ocr, **kwargs)
     82     raise ValueError("Cannot both write_images and embed_images")
     83 parsed_doc = parse_document(
     84     doc,
     85     filename=filename,
   (...)
     95     use_ocr=use_ocr,
     96 )
---> 97 return parsed_doc.to_markdown(
     98     header=header,
     99     footer=footer,
    100     write_images=write_images,
    101     embed_images=embed_images,
    102     ignore_code=ignore_code,
    103     show_progress=show_progress,
    104     page_separators=page_separators,
    105     page_chunks=page_chunks,
    106     use_ocr=use_ocr,
    107 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py:718, in ParsedDocument.to_markdown(self, header, footer, write_images, embed_images, ignore_code, show_progress, page_separators, page_chunks, **kwargs)
    716     md_string += list_item_to_md(box.textlines, list_item_levels[i])
    717 elif btype == "footnote":
--> 718     md_string += footnote_to_md(box.textlines)
    719 elif not header and btype == "page-header":
    720     continue
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py:480, in footnote_to_md(textlines)
    470 def footnote_to_md(textlines):
    471     """
    472     Convert "footnote" bboxes to markdown.
    473     The first line is prefixed with "> ". Subsequent lines are appended
   (...)
    478     one list item is contained in a single bbox.
    479     """
--> 480     line = textlines[0]
    481     spans = line["spans"]
    482     output = "> "



@MW_UK As expected it as the page index at 184 in the document which contains an interesting table which causes the error.
I attach a version of your PDF with this page removed - just to validate the rest of the document - and then a single page PDF (guilty.pdf ! ) where we can quickly replicate the error.

This was my code:

import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

doc=pymupdf.open(“scho0909bqyv-e-e.pdf”)

try:

  md = pymupdf4llm.to_markdown(
    doc,
    show_progress=True,
  )

  print(md)

except Exception as e:
  print(f’{e=}')
  input('Exception, press return to continue… ? ')
  # is there any way to continue the process and overlook this exception?

guilty.pdf (16.0 KB)

scho0909bqyv-e-e-x.pdf (7.9 MB)

@HaraldLieder Do you have any ideas about how we could improve this or how best to handle the exception? Is there a possibility of skipping a page in the event of an error? ( Perhaps it converts the output to a warning in the result. )

Nice one! I might see how many times this error occurs in the entire corpus - if it’s once or twice, I can hack a file-name triggered exception to strip out specific pages before extracting, now I know what to look for.

If it’s more than 2-3 I’ll look into ways to ignore. Falling back to basic extraction would give me some text, but the rich headings I get in the MD are invaluable for targeted cleansing.

One thought - are there any hidden parameters to trigger an “ignore_tables” type-thing? Would skipping them altogether help? Just injecting a “removed table” type string?

Thanks again, you have been a massive help :slight_smile:

Hi Jamie, I suggest processing the file page by page. This allows to react individually. E.g.

doc = pymupdf.open(“problem.pdf”)
md = “”
for page in doc:
try:
md += pymupdf4lm.to_markdown(…, pages=[page.number], …)
except Exception as e:
print(f"Skipping problem {page.number=}: {e}")

print(md)

1 Like

Thanks @HaraldLieder - great suggestion, so using this logic:

import pymupdf.layout
import pymupdf4llm

doc = pymupdf.open("scho0909bqyv-e-e.pdf")
md = ""

for page in doc:
  try:
    print(f"Processing page index: {page.number}")
    md += pymupdf4llm.to_markdown(doc, pages=[page.number])
  except Exception as e:
    print(f"Skipping problem at page index: {page.number}: {e}")
    md += "Page index: {page.number} not processed" # or whatever message

print(md)

We can faithfully skip problematic pages.

1 Like

Brilliant, I have it working as a fall-back, so 4LLM > 4LLM_page_by_page > standard extract.

Thank you so much! :folded_hands:

I’ve gone for a fallback because it helps manage console messages - I s there a built-in way that I can supress the document parser messages without the additional complexity of redirecting to /dev/null.

Processing just 12 of my >1000 documents is generating >55,000 lines of output - most of which is some variation of “Full-page OCR on page.number=334/335.”

I don’t see anything like that in the API - PyMuPDF documentation

So I defer to @HaraldLieder to see what might be possible here.

Please consult the documentation: pymupdf.set_messages() allows you to direct messages to files or streams, etc. Functions - PyMuPDF documentation

Ah no, sorry: I’m doing a normal print() unfortunately. Will be changed in the next version.

1 Like

I’ve just published a new version 0.2.8 for PyMuPDF4LLM.
This includes a fixe for this problem:
I am now using pymupdf.message(...) for previous output via simple print().
This means that you can suppress or re-direct messages via pymupdf.set_messages(...).

1 Like

That’s fantastic! Thank you for the updates!

I have finished my first batch of extraction but will certainly rerun a sample to see how it works before starting on my larger batches.