To_markdown only producing header tags (and no text), to_json produces correct text from spans

Official Copy (Register) - NK92733-extract.pdf (441.5 KB)

I am trying to extract markdown text from this document but it only produces a set of markdown headers with no text. When I try to_json the spans in the document are correctly recognised. I would expect to_markdown to recognise these spans and use them.

Issue looks similar to this https://forum.mupdf.com/t/problem-with-pymupdf4llm-to-markdown/326

Code to reproduce:

import sys
import pymupdf.layout
import pymupdf4llm
import pymupdf


def main(filepath):
    markdown_text = ""
    json_text = ""
    try:
        with pymupdf.open(filepath) as pdf:  
            markdown_text = pymupdf4llm.to_markdown(
                pdf,
                use_ocr=2,
            )
            json_text = pymupdf4llm.to_json(
                pdf,
                use_ocr=2,  
            )

    except Exception as e:
        print(f"Error processing PDF with PyMuPDF: {e}")

    print(markdown_text)
    print(json_text)

    try:
        with open(f"D:\\database\\raw_pymupdf.md", 'w', encoding='utf-8') as f:
            f.write(markdown_text)
        with open(f"D:\\database\\raw_pymupdf.json", 'w', encoding='utf-8') as f:
            f.write(json_text)
    except Exception as e:
        print(f"Error saving content data for raw pdf: {e}")

if __name__ == "__main__":
    # get command line argument for file content
    print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
    if len(sys.argv) < 2:
        print("Usage: python ingestion_helpers.py <filepath>")
        sys.exit(1)
    filepath = sys.argv[1]
    main(filepath)

pymupdf.version=(β€˜1.27.2.2’, β€˜1.27.2’, None), pymupdf4llm.version=β€˜1.27.2.2’

Any ideas?

BBUK

I should say that straight pymupdf extracts the span text fine (albeit unformatted), the issue only seems to be with pymupdf4llm.to_markdown().

Hmmm, interesting I do get results using 1.27.2.2 albeit the Markdown seems a but wrong:

=== Document parser messages ===
Using RapidOCR and Tesseract for OCR processing.
OCR on page.number=0/1.
OCR on page.number=1/2.
OCR on page.number=2/3.

The electronic official `The electronic official copy of the register follows this message.` copy of the registerPlease note that this is the only official copy we | `Please note that this is the only official copy we will issue. We will not issue a` > will issue.| We will not issueal paper officialfollows `paper official copy.` this message. copy. 

## HM Land Registry HM Land Registry 

## **Official copy** Official copy **of register of** of register of title **title** 

( There is more than that - but there seems some strangeness with tis output to me )

If I do use_ocr=True then I get much more reliable results:

The electronic official copy of the registerPlease note that this is the only official copy we |> will issue.| We will not issueal paper officialfollows this message. copy. 

HM Land Registry 

## Official copy of register of title 

## Title number NK92733 

## Edition date 04.12.2000 

- This official copy shows the entries on the register of title on 31 AUG 2021 at 10:22:51. 

- This date must be quoted as the "search from date" in any official search application based on this copy. 

- The date at the beginning of an entry is the date on which the entry was made in the register ~~.~~ 

- Issued on 31 Aug 2021, 


Can you confirm you have the required OCR engines installed correctly?

Great, thanks.

I don’t have the OCR engines installed and I don’t really want to use OCR in the sample I provided - the correct span text and bbox info is already there! The longer story is that I am using embed_images=True and where there are any images in the pdfs I am processing, I use a specialised LLM-based image classifier. What I would expect is that span text would be extracted to markdown and any real images (here land plans, but not in the provided sample) would then be separately processed.

Do you know if this is achievable?

Thanks again

BBUK

What happens if you use the previous version of pymupdf4llm? pip install pymupdf4llm==1.27.2.1 and then just:

markdown_text = pymupdf4llm.to_markdown(
                pdf
            )

For me I seem to get the MD and without the need for OCRing.

Yup, that pretty much works 100%. Thanks - possibly a regression. The only (possibly unrelated) thing I see is that embed_images does not work as expected for one of my test files:

Kibblesworth Title Plan TY301184.PDF (297.8 KB)

For this file, I would expect the markdown to be extracted with an embedded base64 encoded image representing the last page. The markdown text is fine but see no embedded image. embed_images does, however, work as expected on other files with pages with mixed text and images.

If helpful, I would add that with that file I do get the embedded base64 image as expected if I set pymupdf4llm.use_layout(False)!?!

The thing is that breaks the markdown generation on other files such as the one I first posted.

@BBUK Yes - I see what you mean - I will log a bug with this with the internal team - seems strange that it doesn’t work as expected.

Thank you.

Will keep you updated when I find out more!

Hi Jamie_Lemon, I have found the root reason of this issue (or a serials of issues includes this one).

The root cause is: Layout mode drops invisible text layer alpha=0 unless patched, causing major text loss in markdown and json boxes.

Summary (By AI)

In pymupdf4llm layout mode (e.g. _layout_to_markdown), transparent text spans alpha=0 are filtered out in the layout extraction path, which causes severe text loss in to_markdown and in to_json pages.boxes.textlines, even when callers need invisible OCR/text layers preserved.

Environment

  • OS: Windows

  • PyMuPDF: 1.27.2.3

  • pymupdf4llm: 1.27.2.3

  • Layout mode: enabled

  • OCR for repro: use_ocr=False, force_ocr=False

Repro

  1. Enable layout mode (after add the support of ignore_alpha argument in layout wrappers).

  2. Run to_markdown and to_json twice on a PDF with transparent text layer:

  • ignore_alpha=False

  • ignore_alpha=True

  1. Compare markdown length and pages.boxes.textlines span character count.

Observed

  • With ignore_alpha=False (Default), some files are near empty in layout outputs.

  • to_json fulltext may look less affected, but boxes.textlines are often empty.

  • This creates a confusing mismatch: fulltext seems present while markdown and box-based outputs are missing.

Measured examples

  • NK92733:

    • markdown length: 93 β†’ 3123

    • json boxes chars: 0 β†’ 2701

  • gupta1985.pdf (1.0 MB):

    • markdown length: 4036 β†’ 50991

    • json boxes chars: 0 β†’ 45204

  • chiang1982.pdf (574.3 KB):

    • markdown length: 2523 β†’ 2616

Expected
When ignore_alpha=True, layout path should preserve invisible spans consistently, same policy for markdown and json boxes/textlines.

Likely root cause
Layout get_raw_lines path uses invisible-span filtering and does not consistently propagate ignore_alpha semantics through layout extraction stages.

Request
Please add official ignore_alpha support in layout path and add regression tests for transparent text-layer PDFs.

Optional separate issue
There is also an unrelated page_chunks path crash when doc.toc is None and code iterates over it.

Minimal Diff Summary For Maintainers

Patch intent
Propagate ignore_alpha through layout wrappers into parse_document and use it to control ignore_invisible in layout line extraction.

Changes

  • Add ignore_alpha parameter to layout wrappers:

    • _layout_to_markdown

    • _layout_to_json

    • _layout_to_text

  • Forward ignore_alpha into parse_document.

  • In parse_document:

    • add ignore_alpha argument

    • store document.ignore_alpha

  • In all relevant get_raw_lines calls in layout parse flow, use:
    ignore_invisible = not (pagelayout.full_ocred or document.ignore_alpha),

Validation outcome

  • Large recovery of extracted text in NK92733 and gupta1985 when ignore_alpha=True.

  • Confirms invisible-layer filtering in layout path is the main cause.

1 Like

The latest version of Patch to fix the issues on these PDFs (includes chiang1982.pdf) is that:

Patch Note: Unifying ignore_alpha Semantics Across Legacy/Layout and Restoring Layout Sensitivity

1. Scope and Objective

This patch package documents local fixes applied in a PyMuPDF4LLM environment to solve inconsistent handling of invisible text (alpha = 0) between:

  • legacy extraction path

  • layout extraction path

and to address a layout pipeline behavior that made ignore_alpha appear ineffective on some PDFs.

Goal:

  • ignore_alpha = False: keep invisible text

  • ignore_alpha = True: ignore invisible text

for both legacy and layout flows.

2. Environment

Observed in local site-packages under:

  • .venv/Lib/site-packages/pymupdf4llm/...

  • .venv/Lib/site-packages/pymupdf/layout/...

Representative package version used during validation: pymupdf4llm 1.27.2.3.

3. Root Cause Summary

Two independent issues were involved:

  1. Semantic mismatch risk across code paths:
  • legacy path and layout path needed explicit alignment to the same ignore_alpha meaning.
  1. Upstream RF filtering side effect in layout stack:
  • RF feature extraction removed invisible samples before downstream text extraction.

  • As a result, toggling ignore_alpha could become non-responsive in layout mode on affected pages/documents.

4. Patch Set

4.1 Layout wrappers: pass ignore_alpha through to parser

File:

  • .venv/Lib/site-packages/pymupdf4llm/__init__.py

Key points:

  • _layout_to_markdown(...) accepts ignore_alpha and forwards it to parse_document(...).

  • _layout_to_json(...) accepts ignore_alpha and forwards it to parse_document(...).

  • _layout_to_text(...) accepts ignore_alpha and forwards it to parse_document(...).

Representative lines:

  • ignore_alpha=False in layout wrapper signatures

  • ignore_alpha=ignore_alpha in parse_document(...) calls

4.2 Layout parser: apply visibility switch in raw-line extraction

File:

  • .venv/Lib/site-packages/pymupdf4llm/helpers/document_layout.py

Changes:

  • parse_document(...) includes parameter ignore_alpha=False

  • document.ignore_alpha = ignore_alpha

  • get_raw_lines(...) calls for picture-force-text / table-fallback / text-like boxes now use:


ignore_invisible=(not pagelayout.full_ocred) and document.ignore_alpha

This makes layout extraction obey the same operational semantics for invisible text handling.

4.3 Legacy path: normalize accept_invisible logic

File:

  • .venv/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py

Applied logic:


parms.accept_invisible = (

page_is_ocr(page) or (not ignore_alpha)

)

Interpretation:

  • OCR pages keep invisible acceptance behavior as designed.

  • Non-OCR pages map directly to the unified meaning:

  • ignore_alpha=False β†’ accept invisible

  • ignore_alpha=True β†’ do not accept invisible

4.4 RF stage: stop deleting invisible samples globally

File:

  • .venv/Lib/site-packages/pymupdf/layout/pymupdf_util_rf.py

Change intent:

  • Removed early deletion of invisible rows in RF extraction (after Line 23).

  • Keep invisible samples available so downstream policy (ignore_alpha) can decide.

Current function keeps a note:


# Keep invisible samples so downstream extraction can still honor

# caller-side visibility policy (e.g., ignore_alpha switches).

5. Unified Semantics (Final)

For both legacy and layout extraction:

  • ignore_alpha=False β†’ include transparent/invisible text

  • ignore_alpha=True β†’ filter transparent/invisible text

6. Why to_json Could Look More Complete Than Markdown

Even before full fix, to_json could appear to contain more text because data surfaces differ by stage:

  • JSON may include broader/fulltext-oriented information

  • Markdown/text rendering can depend on selected layout boxes/line extraction

If upstream layout candidates are reduced (for example by RF filtering), markdown can lose content even when some text exists elsewhere in intermediate structures.

7. Validation Checklist

Recommended validation corpus (used in this debugging thread):

  • base/Official Copy (Register) - NK92733-extract.pdf

  • base/gupta1985.pdf

  • base/chiang1982.pdf

Minimum checks:

  1. legacy mode:
  • compare text length and visible content under ignore_alpha=False/True
  1. layout mode:
  • compare text length and visible content under ignore_alpha=False/True
  1. consistency:
  • verify direction is the same in both modes (False keeps more, True filters more)
  1. regression sanity:
  • ensure normal visible text extraction quality is not degraded.

8. Repro/Verification Script Tips

When reporting to another team, include:

  • exact package versions

  • exact patched files and line snippets

  • one-page targeted case (for example chiang1982 page 2)

  • full-document cross-check on multiple PDFs

  • output metric: extracted text length + spot-check key paragraphs

9. Operational Notes

  • These are local site-packages patches.

  • For durable sharing, convert this into an upstream PR or an internal fork patch set.

  • Keep patch minimal: avoid changing unrelated parsing behavior.

10. Quick Handoff Summary

Implemented local fixes that:

  • propagate ignore_alpha through layout entrypoints,

  • apply visibility filtering policy at layout raw-line extraction,

  • align legacy accept_invisible to the same semantic direction,

  • remove RF-stage hard deletion of invisible samples so layout mode can respond to ignore_alpha switches correctly.

Net effect: legacy/layout now follow one consistent user-facing meaning for ignore_alpha, and layout sensitivity is restored on previously problematic documents.

@flymachine Please don’t spam here with so much AI generated analysis! :slight_smile: Trying to read through your posts to understand it better right now.