OCR disabled because OpenCV not installed

MW_UK · December 12, 2025, 1:52pm

I’m trying to extract clean text from unstructured PDF using PyMuPDF4LLM (and layout) with Python in a data bricks environment and I keep getting the error “OCR disabled because OpenCV not installed”.

But it is installed! I’ve tried normal and headless, and I just can’t win! Here’s my code:

It does perform an extraction, but I would like to use the full functionality if it’s available.

Any suggestions?

import numpy as np
import fitz
import pymupdf.layout
import pymupdf4llm
import cv2

print(pymupdf.__doc__)
print(cv2.__doc__)
print(np.__version__)

doc = pymupdf.open("/path/to/file.pdf")

json = pymupdf4llm.to_json(doc)

print(json)

And this is my output:


PyMuPDF 1.26.6: Python bindings for the MuPDF 1.26.11 library (rebased implementation).
Python 3.12 running on linux (64-bit).


OpenCV Python binary extension loader

2.1.3
OCR disabled because OpenCV not installed.
{
 "filename": "</path/to/redacted.pdf>",
 "page_count": 2,
 "toc": [
  [
   1,
   "A Framework for .....  etc.

Jamie_Lemon · December 12, 2025, 4:31pm

What version of PyMuPDF4LLM are you using? (note also in your code you don’t need the import fitz line here or the cv2 import - they will be imported along with PyMuPDF Layout & PyMuPDF4LLM )

MW_UK · December 12, 2025, 4:42pm

Versions are the most recent installed via:

pip install PyMuPDF pymupdf-layout pymupdf4llm opencv-python

I had left those in from trying to debug, I’ll remove them from my code, thanks

It’s now just:

import os
import pymupdf.layout # (fitz)
import pymupdf4llm

Jamie_Lemon · December 12, 2025, 5:35pm

Okay, so in Python if you do:

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

You get:

pymupdf.version=('1.26.6', '1.26.11', None), pymupdf4llm.version='0.2.7'

Just want to make 100% sure. (Also please note we don’t use the “fitz” for sometime in the API! - PyMuPDF 1.24.3 and Farewell to “Fitz” | Artifex )

MW_UK · December 15, 2025, 1:51pm

It gives this:

pymupdf.version=(‘1.26.6’, ‘1.26.11’, None), pymupdf4llm.version=‘0.2.7’

However, having restarted my cluster over the weekend (it timed out), it seems to work. I now get a list index out of range for the larger documents. Could this be a bug in my code or a limit of the extractor??

******Processing (pdf):  (Approx. 2101.0 words)******
    >Standard PDF detected.  Trying enhanced extraction path
        >Attempting enhanced extraction (PyMuPDF4LLM)
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.

Bigger documents return this:

******Processing (pdf):  (Approx. 171787.0 words)******
    >Standard PDF detected.  Trying enhanced extraction path
        >Attempting enhanced extraction (PyMuPDF4LLM)
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.

        >WARNING: PyMuPDF4LLM Enhanced extraction failed with error: list index out of range. Falling back to standard PyMuPDF.

        >Attempting standard extraction (PyMuPDF)
        >Standard PDF extraction succeeded

    >Completed: pdf. Status: ERROR: Enhanced Extraction failed.  Fallback standard success. Steps: PyMuPDF4LLM_Error_Fallback | Error: list index out of range.  7 of 12 files.

Jamie_Lemon · December 15, 2025, 4:01pm

@MW_UK Thanks for the info and for confirming your versions. I believe the error list index out of range in this context relates to a problem with parsing table content and the enhanced extraction (with PyMuPDF Layout), so it then falls back to standard PyMuPDF.
So I don’t think it is an error in your code here.

Are you able to identify which PDFs the enhanced extraction fails on and the page number?

MW_UK · December 15, 2025, 4:49pm

No worries, thanks for your help!

The PDF in question is: https://assets.publishing.service.gov.uk/media/603539438fa8f54816a78968/scho0909bqyv-e-e.pdf

Running the following code gives the following output & error message:


import pymupdf
import pymupdf.layout
import pymupdf4llm

doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")

json = pymupdf4llm.to_markdown(doc, show_progress=True)

print(json)


Parsing 335 pages of '/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf'...
100%|██████████| 335/335 [01:23<00:00,  3.99it/s]
=== Document parser messages ===
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=353/354.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=334/335.

Generating markdown text...
 55%|█████▍    | 184/335 [00:00<00:00, 7512.36it/s]



[Trace ID: 00-f12546ebec2a9d46a722b043f77be33d-45513a062f141b4d-00]
File <command-6632665700678967>, line 8
      3 import pymupdf4llm
      6 doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")
----> 8 json = pymupdf4llm.to_markdown(doc, show_progress=True)
     10 print(json)


File <command-6632665700678967>, line 8
      3 import pymupdf4llm
      6 doc = pymupdf.open("/dbfs/mnt/lab/unrestricted/FloodDX/corpora/corpus_1_randd/raw_docs/scho0909bqyv-e-e.pdf")
----> 8 json = pymupdf4llm.to_markdown(doc, show_progress=True)
     10 print(json)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/__init__.py:97, in to_markdown(doc, header, footer, pages, write_images, embed_images, image_path, image_format, filename, force_text, page_chunks, page_separators, dpi, ocr_dpi, page_width, page_height, ignore_code, show_progress, use_ocr, **kwargs)
     82     raise ValueError("Cannot both write_images and embed_images")
     83 parsed_doc = parse_document(
     84     doc,
     85     filename=filename,
   (...)
     95     use_ocr=use_ocr,
     96 )
---> 97 return parsed_doc.to_markdown(
     98     header=header,
     99     footer=footer,
    100     write_images=write_images,
    101     embed_images=embed_images,
    102     ignore_code=ignore_code,
    103     show_progress=show_progress,
    104     page_separators=page_separators,
    105     page_chunks=page_chunks,
    106     use_ocr=use_ocr,
    107 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py:718, in ParsedDocument.to_markdown(self, header, footer, write_images, embed_images, ignore_code, show_progress, page_separators, page_chunks, **kwargs)
    716     md_string += list_item_to_md(box.textlines, list_item_levels[i])
    717 elif btype == "footnote":
--> 718     md_string += footnote_to_md(box.textlines)
    719 elif not header and btype == "page-header":
    720     continue
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b808ee37-f0d0-4f8b-aff2-28b2ace3fee3/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py:480, in footnote_to_md(textlines)
    470 def footnote_to_md(textlines):
    471     """
    472     Convert "footnote" bboxes to markdown.
    473     The first line is prefixed with "> ". Subsequent lines are appended
   (...)
    478     one list item is contained in a single bbox.
    479     """
--> 480     line = textlines[0]
    481     spans = line["spans"]
    482     output = "> "

Jamie_Lemon · December 15, 2025, 11:12pm

@MW_UK As expected it as the page index at 184 in the document which contains an interesting table which causes the error.
I attach a version of your PDF with this page removed - just to validate the rest of the document - and then a single page PDF (guilty.pdf ! ) where we can quickly replicate the error.

This was my code:

import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

doc=pymupdf.open(“scho0909bqyv-e-e.pdf”)

try:

  md = pymupdf4llm.to_markdown(
    doc,
    show_progress=True,
  )

  print(md)

except Exception as e:
  print(f’{e=}')
  input('Exception, press return to continue… ? ')
  # is there any way to continue the process and overlook this exception?

guilty.pdf (16.0 KB)

scho0909bqyv-e-e-x.pdf (7.9 MB)

@HaraldLieder Do you have any ideas about how we could improve this or how best to handle the exception? Is there a possibility of skipping a page in the event of an error? ( Perhaps it converts the output to a warning in the result. )

MW_UK · December 16, 2025, 11:37am

Nice one! I might see how many times this error occurs in the entire corpus - if it’s once or twice, I can hack a file-name triggered exception to strip out specific pages before extracting, now I know what to look for.

If it’s more than 2-3 I’ll look into ways to ignore. Falling back to basic extraction would give me some text, but the rich headings I get in the MD are invaluable for targeted cleansing.

One thought - are there any hidden parameters to trigger an “ignore_tables” type-thing? Would skipping them altogether help? Just injecting a “removed table” type string?

Thanks again, you have been a massive help

HaraldLieder · December 16, 2025, 12:27pm

Hi Jamie, I suggest processing the file page by page. This allows to react individually. E.g.

doc = pymupdf.open(“problem.pdf”)
md = “”
for page in doc:
try:
md += pymupdf4lm.to_markdown(…, pages=[page.number], …)
except Exception as e:
print(f"Skipping problem {page.number=}: {e}")

print(md)

Jamie_Lemon · December 16, 2025, 12:55pm

Thanks @HaraldLieder - great suggestion, so using this logic:

import pymupdf.layout
import pymupdf4llm

doc = pymupdf.open("scho0909bqyv-e-e.pdf")
md = ""

for page in doc:
  try:
    print(f"Processing page index: {page.number}")
    md += pymupdf4llm.to_markdown(doc, pages=[page.number])
  except Exception as e:
    print(f"Skipping problem at page index: {page.number}: {e}")
    md += "Page index: {page.number} not processed" # or whatever message

print(md)

We can faithfully skip problematic pages.

MW_UK · December 16, 2025, 7:59pm

Brilliant, I have it working as a fall-back, so 4LLM > 4LLM_page_by_page > standard extract.

Thank you so much!

I’ve gone for a fallback because it helps manage console messages - I s there a built-in way that I can supress the document parser messages without the additional complexity of redirecting to /dev/null.

Processing just 12 of my >1000 documents is generating >55,000 lines of output - most of which is some variation of “Full-page OCR on page.number=334/335.”

Jamie_Lemon · December 16, 2025, 10:04pm

I don’t see anything like that in the API - PyMuPDF documentation

So I defer to @HaraldLieder to see what might be possible here.

HaraldLieder · December 20, 2025, 9:29am

Please consult the documentation: pymupdf.set_messages() allows you to direct messages to files or streams, etc. Functions - PyMuPDF documentation

HaraldLieder · December 20, 2025, 9:33am

Ah no, sorry: I’m doing a normal print() unfortunately. Will be changed in the next version.

HaraldLieder · January 4, 2026, 5:48pm

I’ve just published a new version 0.2.8 for PyMuPDF4LLM.
This includes a fixe for this problem:
I am now using pymupdf.message(...) for previous output via simple print().
This means that you can suppress or re-direct messages via pymupdf.set_messages(...).

MW_UK · January 6, 2026, 12:09pm

That’s fantastic! Thank you for the updates!

I have finished my first batch of extraction but will certainly rerun a sample to see how it works before starting on my larger batches.

Topic		Replies	Views
Pymupdf4llm forcing re-OCR, on doc that has ocr_spans PyMuPDF font	9	22	April 17, 2026
BUG: pymupdf4llm list index out of range in document_layout.py PyMuPDF	9	50	December 2, 2025
BUG: list index out of range using new layout feature PyMuPDF	16	89	December 11, 2025
Some drawings missing from pymupdf4llm output PyMuPDF	3	37	March 2, 2026
Does pymupdf4llm.to_markdown automatically use OCR? PyMuPDF	2	143	August 14, 2025

OCR disabled because OpenCV not installed

Related topics