BUG: pymupdf4llm parse_document() produces KeyError when document.use_ocr is False

Callum_Davidson · November 24, 2025, 11:37pm

Hi,
Raised this as an issue on the github page before I saw the pinned notice to post here. Apologies. Below is what I posted here:

Parsing a document results in a key error when ocr is disabled. See below traceback from helpers/document_layout.py in parse_document function.

849 else:
     850     decision = {"should_ocr": False}
--> [852]if decision["has_ocr_text"]:  # prevent MD styling if already OCR'd
     853     page_full_ocred = True
     855 if decision["should_ocr"]:
     856     # We should be OCR: check full-page vs. text-only

KeyError: 'has_ocr_text'

This is because when document.use_ocr is false it bypasses the check_ocr.should_ocr_page() method which would normally populate the decision dictionary with the field, and instead produces its own, sparse dictionary which is missing this field. See below for an annotated snippet.

def parse document (
        ...
        if document.use_ocr:
            decision = check_ocr.should_ocr_page(
                page,
                dpi=ocr_dpi,
                edge_thresh=0.015,
                blocks=blocks,
            )
        else:
            decision = {"should_ocr": False} <- dict created here has no "has_ocr_text" field

        if decision["has_ocr_text"]:  <- which KeyErrors here
            page_full_ocred = True
        ...

Suggest either: 1). adding this field to the dict when bypassing this method (quick fix) or 2). refactor to use a dataclass or similar for decision such that required fields are always present with appropriate defaults.

Thanks!

C.

Jamie_Lemon · November 25, 2025, 12:46am

Thanks @Callum_Davidson - I looked at the PR and I think it probably makes sense, will leave to @HaraldLieder as he understands the codebase in depth. Thanks for the contribution!

HaraldLieder · November 30, 2025, 8:56am

This has been fixed in the meantime: version 0.2.4.

Topic		Replies	Views
OCR disabled because OpenCV not installed PyMuPDF	16	131	January 6, 2026
Pymupdf4llm forcing re-OCR, on doc that has ocr_spans PyMuPDF font	9	30	April 17, 2026
Does pymupdf4llm.to_markdown automatically use OCR? PyMuPDF	2	152	August 14, 2025
For scanned documents: PyMuPDFPro PyMuPDF	1	37	January 8, 2026
Pymupdf4llm.to_text() ValueError: invalid literal for int() with base 10 PyMuPDF text	3	16	April 21, 2026

BUG: pymupdf4llm parse_document() produces KeyError when document.use_ocr is False

Related topics