BUG: pymupdf4llm parse_document() produces KeyError when document.use_ocr is False

Hi,
Raised this as an issue on the github page before I saw the pinned notice to post here. Apologies. Below is what I posted here:

Parsing a document results in a key error when ocr is disabled. See below traceback from helpers/document_layout.py in parse_document function.

849 else:
     850     decision = {"should_ocr": False}
--> [852]if decision["has_ocr_text"]:  # prevent MD styling if already OCR'd
     853     page_full_ocred = True
     855 if decision["should_ocr"]:
     856     # We should be OCR: check full-page vs. text-only

KeyError: 'has_ocr_text'

This is because when document.use_ocr is false it bypasses the check_ocr.should_ocr_page() method which would normally populate the decision dictionary with the field, and instead produces its own, sparse dictionary which is missing this field. See below for an annotated snippet.

def parse document (
        ...
        if document.use_ocr:
            decision = check_ocr.should_ocr_page(
                page,
                dpi=ocr_dpi,
                edge_thresh=0.015,
                blocks=blocks,
            )
        else:
            decision = {"should_ocr": False} <- dict created here has no "has_ocr_text" field

        if decision["has_ocr_text"]:  <- which KeyErrors here
            page_full_ocred = True
        ...

Suggest either: 1). adding this field to the dict when bypassing this method (quick fix) or 2). refactor to use a dataclass or similar for decision such that required fields are always present with appropriate defaults.

Thanks!

C.

Thanks @Callum_Davidson - I looked at the PR and I think it probably makes sense, will leave to @HaraldLieder as he understands the codebase in depth. Thanks for the contribution!

This has been fixed in the meantime: version 0.2.4.