Spaces missing after extracting text with Page.get_text()

Hello,

I am experience an issue where some spaces between words appear to missing after extracting text using PyMuPDF’s Page.get_text() method.

Here is the PDF in question: https://export.arxiv.org/pdf/2601.05047v3

The code I am using to observe this behavior is as follow:

    input_pdf = Path("./documents/2601.05047v3.pdf")

    doc = pymupdf.open(input_pdf)
    page = doc.load_page(0)
    page_dict = page.get_text("dict")
    blocks = page_dict.get("blocks", [])
    text_blocks = [block for block in blocks if block.get("type") == 0]

    for block in text_blocks:
        lines = block.get("lines", [])
        for line in lines:
            spans = line.get("spans", [])

            for span in spans:
                print(span.get("text"))

Below are some instances from the first page of this document where I am seeing missing spaces:

  • “​for 10X memory capacity with​​HBM-like bandwidth; Processing-Near-Memory and​”
  • “. Rather than a single dense​​feedforward block, MoE uses tens to hundreds​”
  • “. Reasoning is a think-before-act​​technique to improve quality. An extra “thinking”​"
  • “. LLMs have evolved from text to image,​​audio, and video generation. Larger data types​”
  • “​A context window refers to the amount​​of information the LLM model can look at when​”

I am using pymupdf v1.27.1.

Any assistance would be greatly appreciated!

Interesting I can replicate if I import pymupdf.layout & pymupdf4llm - and only when I use pymupdf4llm.to_text can you confirm if you are importing these?

Hi @Jamie_Lemon, thank you for the quick response. I am not importing either of those. Nor are they installed in my environment. I’m only installing pymupdf.

Very strange - I cannot replicate this when I only use import pymupdf. Please see my attached developer experience video. There is something odd about those spaces though - they look larger than regular spaces.

@HaraldLieder Please let us know your thoughts when you have time!

1 Like

This PDF awkwardly uses certain space suppression techniques via so-called “ActualText” (to prevent certain things from happening when copy-pasting of extracting).
So if you use our flag bit that ignores ActualText, the space will re-appear, like so:
text = page.get_text(flags=pymupdf.TEXT_IGNORE_ACTUALTEXT).

1 Like

Wow, excellent! Confirming that using that flag bit does now include the expected spaces in the extracted text. As @Jamie_Lemon pointed out, they do appear oddly larger than a normal space, but this is definitely more satisfactory than omitting the spaces.

Thank you @Jamie_Lemon and @HaraldLieder for your assistance!

1 Like

The reason for those “oddly larger” spaces is that the PDF author encoded them like that … for reasons only he / she can know. The native PDF text (i.e. including “ActualText” particles) contains

10X memory capacity with​​HBM-like bandwidth

The Unicode ​ is called “Zero‑Width Space”. It occupies no room and is normally used to indicate “if needed, insert a line break here”. Why we have two of them in a row remains the author’s secret. As per the specifications this never makes sense.

If you use the recommended flag bit, ActualText is ignored and this Unicode is replaced by 1 space … thus leading to the result.

1 Like

Thank you for the context, that is very helpful!