The latest version of Patch to fix the issues on these PDFs (includes chiang1982.pdf) is that:
Patch Note: Unifying ignore_alpha Semantics Across Legacy/Layout and Restoring Layout Sensitivity
1. Scope and Objective
This patch package documents local fixes applied in a PyMuPDF4LLM environment to solve inconsistent handling of invisible text (alpha = 0) between:
-
legacy extraction path
-
layout extraction path
and to address a layout pipeline behavior that made ignore_alpha appear ineffective on some PDFs.
Goal:
for both legacy and layout flows.
2. Environment
Observed in local site-packages under:
Representative package version used during validation: pymupdf4llm 1.27.2.3.
3. Root Cause Summary
Two independent issues were involved:
- Semantic mismatch risk across code paths:
- legacy path and layout path needed explicit alignment to the same
ignore_alpha meaning.
- Upstream RF filtering side effect in layout stack:
-
RF feature extraction removed invisible samples before downstream text extraction.
-
As a result, toggling ignore_alpha could become non-responsive in layout mode on affected pages/documents.
4. Patch Set
4.1 Layout wrappers: pass ignore_alpha through to parser
File:
.venv/Lib/site-packages/pymupdf4llm/__init__.py
Key points:
-
_layout_to_markdown(...) accepts ignore_alpha and forwards it to parse_document(...).
-
_layout_to_json(...) accepts ignore_alpha and forwards it to parse_document(...).
-
_layout_to_text(...) accepts ignore_alpha and forwards it to parse_document(...).
Representative lines:
4.2 Layout parser: apply visibility switch in raw-line extraction
File:
.venv/Lib/site-packages/pymupdf4llm/helpers/document_layout.py
Changes:
-
parse_document(...) includes parameter ignore_alpha=False
-
document.ignore_alpha = ignore_alpha
-
get_raw_lines(...) calls for picture-force-text / table-fallback / text-like boxes now use:
ignore_invisible=(not pagelayout.full_ocred) and document.ignore_alpha
This makes layout extraction obey the same operational semantics for invisible text handling.
4.3 Legacy path: normalize accept_invisible logic
File:
.venv/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py
Applied logic:
parms.accept_invisible = (
page_is_ocr(page) or (not ignore_alpha)
)
Interpretation:
-
OCR pages keep invisible acceptance behavior as designed.
-
Non-OCR pages map directly to the unified meaning:
-
ignore_alpha=False β accept invisible
-
ignore_alpha=True β do not accept invisible
4.4 RF stage: stop deleting invisible samples globally
File:
.venv/Lib/site-packages/pymupdf/layout/pymupdf_util_rf.py
Change intent:
Current function keeps a note:
# Keep invisible samples so downstream extraction can still honor
# caller-side visibility policy (e.g., ignore_alpha switches).
5. Unified Semantics (Final)
For both legacy and layout extraction:
6. Why to_json Could Look More Complete Than Markdown
Even before full fix, to_json could appear to contain more text because data surfaces differ by stage:
If upstream layout candidates are reduced (for example by RF filtering), markdown can lose content even when some text exists elsewhere in intermediate structures.
7. Validation Checklist
Recommended validation corpus (used in this debugging thread):
Minimum checks:
- legacy mode:
- compare text length and visible content under
ignore_alpha=False/True
- layout mode:
- compare text length and visible content under
ignore_alpha=False/True
- consistency:
- verify direction is the same in both modes (
False keeps more, True filters more)
- regression sanity:
- ensure normal visible text extraction quality is not degraded.
8. Repro/Verification Script Tips
When reporting to another team, include:
-
exact package versions
-
exact patched files and line snippets
-
one-page targeted case (for example chiang1982 page 2)
-
full-document cross-check on multiple PDFs
-
output metric: extracted text length + spot-check key paragraphs
9. Operational Notes
-
These are local site-packages patches.
-
For durable sharing, convert this into an upstream PR or an internal fork patch set.
-
Keep patch minimal: avoid changing unrelated parsing behavior.
10. Quick Handoff Summary
Implemented local fixes that:
-
propagate ignore_alpha through layout entrypoints,
-
apply visibility filtering policy at layout raw-line extraction,
-
align legacy accept_invisible to the same semantic direction,
-
remove RF-stage hard deletion of invisible samples so layout mode can respond to ignore_alpha switches correctly.
Net effect: legacy/layout now follow one consistent user-facing meaning for ignore_alpha, and layout sensitivity is restored on previously problematic documents.