Hello,
I am experience an issue where some spaces between words appear to missing after extracting text using PyMuPDF’s Page.get_text() method.
Here is the PDF in question: https://export.arxiv.org/pdf/2601.05047v3
The code I am using to observe this behavior is as follow:
input_pdf = Path("./documents/2601.05047v3.pdf")
doc = pymupdf.open(input_pdf)
page = doc.load_page(0)
page_dict = page.get_text("dict")
blocks = page_dict.get("blocks", [])
text_blocks = [block for block in blocks if block.get("type") == 0]
for block in text_blocks:
lines = block.get("lines", [])
for line in lines:
spans = line.get("spans", [])
for span in spans:
print(span.get("text"))
Below are some instances from the first page of this document where I am seeing missing spaces:
- “for 10X memory capacity withHBM-like bandwidth; Processing-Near-Memory and”
- “. Rather than a single densefeedforward block, MoE uses tens to hundreds”
- “. Reasoning is a think-before-acttechnique to improve quality. An extra “thinking”"
- “. LLMs have evolved from text to image,audio, and video generation. Larger data types”
- “A context window refers to the amountof information the LLM model can look at when”
I am using pymupdf v1.27.1.
Any assistance would be greatly appreciated!
Interesting I can replicate if I import pymupdf.layout & pymupdf4llm - and only when I use pymupdf4llm.to_text can you confirm if you are importing these?
Hi @Jamie_Lemon, thank you for the quick response. I am not importing either of those. Nor are they installed in my environment. I’m only installing pymupdf.
Very strange - I cannot replicate this when I only use import pymupdf. Please see my attached developer experience video. There is something odd about those spaces though - they look larger than regular spaces.
@HaraldLieder Please let us know your thoughts when you have time!
1 Like
This PDF awkwardly uses certain space suppression techniques via so-called “ActualText” (to prevent certain things from happening when copy-pasting of extracting).
So if you use our flag bit that ignores ActualText, the space will re-appear, like so:
text = page.get_text(flags=pymupdf.TEXT_IGNORE_ACTUALTEXT).
1 Like
Wow, excellent! Confirming that using that flag bit does now include the expected spaces in the extracted text. As @Jamie_Lemon pointed out, they do appear oddly larger than a normal space, but this is definitely more satisfactory than omitting the spaces.
Thank you @Jamie_Lemon and @HaraldLieder for your assistance!
1 Like
The reason for those “oddly larger” spaces is that the PDF author encoded them like that … for reasons only he / she can know. The native PDF text (i.e. including “ActualText” particles) contains
10X memory capacity with​​HBM-like bandwidth
The Unicode ​ is called “Zero‑Width Space”. It occupies no room and is normally used to indicate “if needed, insert a line break here”. Why we have two of them in a row remains the author’s secret. As per the specifications this never makes sense.
If you use the recommended flag bit, ActualText is ignored and this Unicode is replaced by 1 space … thus leading to the result.
1 Like
Thank you for the context, that is very helpful!