To_markdown only producing header tags (and no text), to_json produces correct text from spans

BBUK · April 14, 2026, 8:24pm

Official Copy (Register) - NK92733-extract.pdf (441.5 KB)

I am trying to extract markdown text from this document but it only produces a set of markdown headers with no text. When I try to_json the spans in the document are correctly recognised. I would expect to_markdown to recognise these spans and use them.

Issue looks similar to this https://forum.mupdf.com/t/problem-with-pymupdf4llm-to-markdown/326

Code to reproduce:

import sys
import pymupdf.layout
import pymupdf4llm
import pymupdf


def main(filepath):
    markdown_text = ""
    json_text = ""
    try:
        with pymupdf.open(filepath) as pdf:  
            markdown_text = pymupdf4llm.to_markdown(
                pdf,
                use_ocr=2,
            )
            json_text = pymupdf4llm.to_json(
                pdf,
                use_ocr=2,  
            )

    except Exception as e:
        print(f"Error processing PDF with PyMuPDF: {e}")

    print(markdown_text)
    print(json_text)

    try:
        with open(f"D:\\database\\raw_pymupdf.md", 'w', encoding='utf-8') as f:
            f.write(markdown_text)
        with open(f"D:\\database\\raw_pymupdf.json", 'w', encoding='utf-8') as f:
            f.write(json_text)
    except Exception as e:
        print(f"Error saving content data for raw pdf: {e}")

if __name__ == "__main__":
    # get command line argument for file content
    print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
    if len(sys.argv) < 2:
        print("Usage: python ingestion_helpers.py <filepath>")
        sys.exit(1)
    filepath = sys.argv[1]
    main(filepath)

pymupdf.version=(‘1.27.2.2’, ‘1.27.2’, None), pymupdf4llm.version=‘1.27.2.2’

Any ideas?

BBUK

BBUK · April 14, 2026, 9:24pm

I should say that straight pymupdf extracts the span text fine (albeit unformatted), the issue only seems to be with pymupdf4llm.to_markdown().

Jamie_Lemon · April 14, 2026, 10:05pm

Hmmm, interesting I do get results using 1.27.2.2 albeit the Markdown seems a but wrong:

=== Document parser messages ===
Using RapidOCR and Tesseract for OCR processing.
OCR on page.number=0/1.
OCR on page.number=1/2.
OCR on page.number=2/3.

The electronic official `The electronic official copy of the register follows this message.` copy of the registerPlease note that this is the only official copy we | `Please note that this is the only official copy we will issue. We will not issue a` > will issue.| We will not issueal paper officialfollows `paper official copy.` this message. copy. 

## HM Land Registry HM Land Registry 

## **Official copy** Official copy **of register of** of register of title **title**

( There is more than that - but there seems some strangeness with tis output to me )

If I do use_ocr=True then I get much more reliable results:

The electronic official copy of the registerPlease note that this is the only official copy we |> will issue.| We will not issueal paper officialfollows this message. copy. 

HM Land Registry 

## Official copy of register of title 

## Title number NK92733 

## Edition date 04.12.2000 

- This official copy shows the entries on the register of title on 31 AUG 2021 at 10:22:51. 

- This date must be quoted as the "search from date" in any official search application based on this copy. 

- The date at the beginning of an entry is the date on which the entry was made in the register ~~.~~ 

- Issued on 31 Aug 2021,

Can you confirm you have the required OCR engines installed correctly?

BBUK · April 14, 2026, 10:26pm

Great, thanks.

I don’t have the OCR engines installed and I don’t really want to use OCR in the sample I provided - the correct span text and bbox info is already there! The longer story is that I am using embed_images=True and where there are any images in the pdfs I am processing, I use a specialised LLM-based image classifier. What I would expect is that span text would be extracted to markdown and any real images (here land plans, but not in the provided sample) would then be separately processed.

Do you know if this is achievable?

Thanks again

BBUK

Jamie_Lemon · April 14, 2026, 10:53pm

What happens if you use the previous version of pymupdf4llm? pip install pymupdf4llm==1.27.2.1 and then just:

markdown_text = pymupdf4llm.to_markdown(
                pdf
            )

For me I seem to get the MD and without the need for OCRing.

BBUK · April 15, 2026, 12:34am

Yup, that pretty much works 100%. Thanks - possibly a regression. The only (possibly unrelated) thing I see is that embed_images does not work as expected for one of my test files:

Kibblesworth Title Plan TY301184.PDF (297.8 KB)

For this file, I would expect the markdown to be extracted with an embedded base64 encoded image representing the last page. The markdown text is fine but see no embedded image. embed_images does, however, work as expected on other files with pages with mixed text and images.

BBUK · April 15, 2026, 1:06am

If helpful, I would add that with that file I do get the embedded base64 image as expected if I set pymupdf4llm.use_layout(False)!?!

The thing is that breaks the markdown generation on other files such as the one I first posted.

Jamie_Lemon · April 15, 2026, 10:00pm

@BBUK Yes - I see what you mean - I will log a bug with this with the internal team - seems strange that it doesn’t work as expected.

BBUK · April 15, 2026, 10:06pm

Thank you.

Jamie_Lemon · April 15, 2026, 11:08pm

Will keep you updated when I find out more!

flymachine · May 6, 2026, 7:13am

Hi Jamie_Lemon, I have found the root reason of this issue (or a serials of issues includes this one).

The root cause is: Layout mode drops invisible text layer alpha=0 unless patched, causing major text loss in markdown and json boxes.

Summary (By AI)

In pymupdf4llm layout mode (e.g. _layout_to_markdown), transparent text spans alpha=0 are filtered out in the layout extraction path, which causes severe text loss in to_markdown and in to_json pages.boxes.textlines, even when callers need invisible OCR/text layers preserved.

Environment

OS: Windows
PyMuPDF: 1.27.2.3
pymupdf4llm: 1.27.2.3
Layout mode: enabled
OCR for repro: use_ocr=False, force_ocr=False

Repro

Enable layout mode (after add the support of ignore_alpha argument in layout wrappers).
Run to_markdown and to_json twice on a PDF with transparent text layer:

ignore_alpha=False
ignore_alpha=True

Compare markdown length and pages.boxes.textlines span character count.

Observed

With ignore_alpha=False (Default), some files are near empty in layout outputs.
to_json fulltext may look less affected, but boxes.textlines are often empty.
This creates a confusing mismatch: fulltext seems present while markdown and box-based outputs are missing.

Measured examples

NK92733:
- markdown length: 93 → 3123
- json boxes chars: 0 → 2701
gupta1985.pdf (1.0 MB):
- markdown length: 4036 → 50991
- json boxes chars: 0 → 45204
chiang1982.pdf (574.3 KB):
- markdown length: 2523 → 2616

Expected
When ignore_alpha=True, layout path should preserve invisible spans consistently, same policy for markdown and json boxes/textlines.

Likely root cause
Layout get_raw_lines path uses invisible-span filtering and does not consistently propagate ignore_alpha semantics through layout extraction stages.

Request
Please add official ignore_alpha support in layout path and add regression tests for transparent text-layer PDFs.

Optional separate issue
There is also an unrelated page_chunks path crash when doc.toc is None and code iterates over it.

Minimal Diff Summary For Maintainers

Patch intent
Propagate ignore_alpha through layout wrappers into parse_document and use it to control ignore_invisible in layout line extraction.

Changes

Add ignore_alpha parameter to layout wrappers:
- _layout_to_markdown
- _layout_to_json
- _layout_to_text
Forward ignore_alpha into parse_document.
In parse_document:
- add ignore_alpha argument
- store document.ignore_alpha
In all relevant get_raw_lines calls in layout parse flow, use:
ignore_invisible = not (pagelayout.full_ocred or document.ignore_alpha),

Validation outcome

Large recovery of extracted text in NK92733 and gupta1985 when ignore_alpha=True.
Confirms invisible-layer filtering in layout path is the main cause.

flymachine · May 6, 2026, 12:50pm

The latest version of Patch to fix the issues on these PDFs (includes chiang1982.pdf) is that:

Patch Note: Unifying `ignore_alpha` Semantics Across Legacy/Layout and Restoring Layout Sensitivity

1. Scope and Objective

This patch package documents local fixes applied in a PyMuPDF4LLM environment to solve inconsistent handling of invisible text (alpha = 0) between:

legacy extraction path
layout extraction path

and to address a layout pipeline behavior that made ignore_alpha appear ineffective on some PDFs.

Goal:

ignore_alpha = False: keep invisible text
ignore_alpha = True: ignore invisible text

for both legacy and layout flows.

2. Environment

Observed in local site-packages under:

.venv/Lib/site-packages/pymupdf4llm/...
.venv/Lib/site-packages/pymupdf/layout/...

Representative package version used during validation: pymupdf4llm 1.27.2.3.

3. Root Cause Summary

Two independent issues were involved:

Semantic mismatch risk across code paths:

legacy path and layout path needed explicit alignment to the same ignore_alpha meaning.

Upstream RF filtering side effect in layout stack:

RF feature extraction removed invisible samples before downstream text extraction.
As a result, toggling ignore_alpha could become non-responsive in layout mode on affected pages/documents.

4. Patch Set

4.1 Layout wrappers: pass `ignore_alpha` through to parser

File:

.venv/Lib/site-packages/pymupdf4llm/__init__.py

Key points:

_layout_to_markdown(...) accepts ignore_alpha and forwards it to parse_document(...).
_layout_to_json(...) accepts ignore_alpha and forwards it to parse_document(...).
_layout_to_text(...) accepts ignore_alpha and forwards it to parse_document(...).

Representative lines:

ignore_alpha=False in layout wrapper signatures
ignore_alpha=ignore_alpha in parse_document(...) calls

4.2 Layout parser: apply visibility switch in raw-line extraction

File:

.venv/Lib/site-packages/pymupdf4llm/helpers/document_layout.py

Changes:

parse_document(...) includes parameter ignore_alpha=False
document.ignore_alpha = ignore_alpha
get_raw_lines(...) calls for picture-force-text / table-fallback / text-like boxes now use:


ignore_invisible=(not pagelayout.full_ocred) and document.ignore_alpha

This makes layout extraction obey the same operational semantics for invisible text handling.

4.3 Legacy path: normalize `accept_invisible` logic

File:

.venv/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py

Applied logic:


parms.accept_invisible = (

page_is_ocr(page) or (not ignore_alpha)

)

Interpretation:

OCR pages keep invisible acceptance behavior as designed.
Non-OCR pages map directly to the unified meaning:
ignore_alpha=False → accept invisible
ignore_alpha=True → do not accept invisible

4.4 RF stage: stop deleting invisible samples globally

File:

.venv/Lib/site-packages/pymupdf/layout/pymupdf_util_rf.py

Change intent:

Removed early deletion of invisible rows in RF extraction (after Line 23).
Keep invisible samples available so downstream policy (ignore_alpha) can decide.

Current function keeps a note:


# Keep invisible samples so downstream extraction can still honor

# caller-side visibility policy (e.g., ignore_alpha switches).

5. Unified Semantics (Final)

For both legacy and layout extraction:

ignore_alpha=False → include transparent/invisible text
ignore_alpha=True → filter transparent/invisible text

6. Why `to_json` Could Look More Complete Than Markdown

Even before full fix, to_json could appear to contain more text because data surfaces differ by stage:

JSON may include broader/fulltext-oriented information
Markdown/text rendering can depend on selected layout boxes/line extraction

If upstream layout candidates are reduced (for example by RF filtering), markdown can lose content even when some text exists elsewhere in intermediate structures.

7. Validation Checklist

Recommended validation corpus (used in this debugging thread):

base/Official Copy (Register) - NK92733-extract.pdf
base/gupta1985.pdf
base/chiang1982.pdf

Minimum checks:

legacy mode:

compare text length and visible content under ignore_alpha=False/True

layout mode:

compare text length and visible content under ignore_alpha=False/True

consistency:

verify direction is the same in both modes (False keeps more, True filters more)

regression sanity:

ensure normal visible text extraction quality is not degraded.

8. Repro/Verification Script Tips

When reporting to another team, include:

exact package versions
exact patched files and line snippets
one-page targeted case (for example chiang1982 page 2)
full-document cross-check on multiple PDFs
output metric: extracted text length + spot-check key paragraphs

9. Operational Notes

These are local site-packages patches.
For durable sharing, convert this into an upstream PR or an internal fork patch set.
Keep patch minimal: avoid changing unrelated parsing behavior.

10. Quick Handoff Summary

Implemented local fixes that:

propagate ignore_alpha through layout entrypoints,
apply visibility filtering policy at layout raw-line extraction,
align legacy accept_invisible to the same semantic direction,
remove RF-stage hard deletion of invisible samples so layout mode can respond to ignore_alpha switches correctly.

Net effect: legacy/layout now follow one consistent user-facing meaning for ignore_alpha, and layout sensitivity is restored on previously problematic documents.

Jamie_Lemon · May 6, 2026, 1:03pm

@flymachine Please don’t spam here with so much AI generated analysis! Trying to read through your posts to understand it better right now.

Topic		Replies	Views
Pymupdf4llm forcing re-OCR, on doc that has ocr_spans PyMuPDF font	9	44	April 17, 2026
Graphic wrongly placed in md file output from pymupdf4llm.to_markdown PyMuPDF	11	71	July 22, 2025
Pymupdf layout table detection issue PyMuPDF	14	126	February 24, 2026
OCR disabled because OpenCV not installed PyMuPDF	16	133	January 6, 2026
Underlines not handled by pymupdf4llm.to_markdown PyMuPDF	9	96	August 13, 2025

To_markdown only producing header tags (and no text), to_json produces correct text from spans

Summary (By AI)

Patch Note: Unifying ignore_alpha Semantics Across Legacy/Layout and Restoring Layout Sensitivity

1. Scope and Objective

2. Environment

3. Root Cause Summary

4. Patch Set

4.1 Layout wrappers: pass ignore_alpha through to parser

4.2 Layout parser: apply visibility switch in raw-line extraction

4.3 Legacy path: normalize accept_invisible logic

4.4 RF stage: stop deleting invisible samples globally

5. Unified Semantics (Final)

6. Why to_json Could Look More Complete Than Markdown

7. Validation Checklist

8. Repro/Verification Script Tips

9. Operational Notes

10. Quick Handoff Summary

Related topics

Patch Note: Unifying `ignore_alpha` Semantics Across Legacy/Layout and Restoring Layout Sensitivity

4.1 Layout wrappers: pass `ignore_alpha` through to parser

4.3 Legacy path: normalize `accept_invisible` logic

6. Why `to_json` Could Look More Complete Than Markdown