Spaces missing after extracting text with Page.get_text()

rdelaney · February 24, 2026, 4:16pm

Hello,

I am experience an issue where some spaces between words appear to missing after extracting text using PyMuPDF’s Page.get_text() method.

Here is the PDF in question: https://export.arxiv.org/pdf/2601.05047v3

The code I am using to observe this behavior is as follow:

    input_pdf = Path("./documents/2601.05047v3.pdf")

    doc = pymupdf.open(input_pdf)
    page = doc.load_page(0)
    page_dict = page.get_text("dict")
    blocks = page_dict.get("blocks", [])
    text_blocks = [block for block in blocks if block.get("type") == 0]

    for block in text_blocks:
        lines = block.get("lines", [])
        for line in lines:
            spans = line.get("spans", [])

            for span in spans:
                print(span.get("text"))

Below are some instances from the first page of this document where I am seeing missing spaces:

“for 10X memory capacity withHBM-like bandwidth; Processing-Near-Memory and”
“. Rather than a single densefeedforward block, MoE uses tens to hundreds”
“. Reasoning is a think-before-acttechnique to improve quality. An extra “thinking”"
“. LLMs have evolved from text to image,audio, and video generation. Larger data types”
“A context window refers to the amountof information the LLM model can look at when”

I am using pymupdf v1.27.1.

Any assistance would be greatly appreciated!

Jamie_Lemon · February 24, 2026, 5:02pm

Interesting I can replicate if I import pymupdf.layout & pymupdf4llm - and only when I use pymupdf4llm.to_text can you confirm if you are importing these?

rdelaney · February 24, 2026, 5:21pm

Hi @Jamie_Lemon, thank you for the quick response. I am not importing either of those. Nor are they installed in my environment. I’m only installing pymupdf.

Jamie_Lemon · February 24, 2026, 6:02pm

Very strange - I cannot replicate this when I only use import pymupdf. Please see my attached developer experience video. There is something odd about those spaces though - they look larger than regular spaces.

@HaraldLieder Please let us know your thoughts when you have time!

HaraldLieder · February 24, 2026, 9:20pm

This PDF awkwardly uses certain space suppression techniques via so-called “ActualText” (to prevent certain things from happening when copy-pasting of extracting).
So if you use our flag bit that ignores ActualText, the space will re-appear, like so:
text = page.get_text(flags=pymupdf.TEXT_IGNORE_ACTUALTEXT).

rdelaney · February 25, 2026, 12:30am

Wow, excellent! Confirming that using that flag bit does now include the expected spaces in the extracted text. As @Jamie_Lemon pointed out, they do appear oddly larger than a normal space, but this is definitely more satisfactory than omitting the spaces.

Thank you @Jamie_Lemon and @HaraldLieder for your assistance!

HaraldLieder · February 25, 2026, 10:13am

The reason for those “oddly larger” spaces is that the PDF author encoded them like that … for reasons only he / she can know. The native PDF text (i.e. including “ActualText” particles) contains

10X memory capacity withHBM-like bandwidth

The Unicode  is called “Zero‑Width Space”. It occupies no room and is normally used to indicate “if needed, insert a line break here”. Why we have two of them in a row remains the author’s secret. As per the specifications this never makes sense.

If you use the recommended flag bit, ActualText is ignored and this Unicode is replaced by 1 space … thus leading to the result.

rdelaney · February 25, 2026, 2:09pm

Thank you for the context, that is very helpful!

Topic		Replies	Views
Any idea what is wrong with this PDF? PyMuPDF	6	170	July 9, 2025
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	9	12	April 15, 2026
Bug: pymupdf4llm: Hyphenated words not joined when split across text blocks PyMuPDF	1	153	January 4, 2026
OCR disabled because OpenCV not installed PyMuPDF	16	127	January 6, 2026
Pymupdf4llm forcing re-OCR, on doc that has ocr_spans PyMuPDF font	9	21	April 17, 2026

Spaces missing after extracting text with Page.get_text()

Related topics