Pymupdf4llm: performance

robvd · November 18, 2025, 8:17am

The fix for ValueError: min() arg is an empty sequence · Issue #319 · pymupdf/pymupdf4llm · GitHub also mentioned fixing a performance issue. Although the performance for the file attached to that issue is now acceptable, I have many more pdfs to convert that are much slower. For example the attached file - processing takes many hours. I unfortunately don’t control the source of these pdfs.

Is there any way that the performance of the attached file can be sped up as well?

slow_performance.pdf (9.2 MB)

HaraldLieder · November 19, 2025, 8:16pm

@robvd Welcome to the Forum!

Found the problem! This PDF uses PDF structure information in a crazy manner, slowing down things that normally can be done in a breeze.
As we currently ignore this information category at all we might as well get the whole monster out of the way.
Doing this speeds up processing this file enormously: markdown creation of the 331 pages took me less than 6 minutes (about 1 second per page).
Checking out ways to include this in the next version 0.2.3.

HaraldLieder · November 19, 2025, 10:57pm

Here is the output protocol of a markdown conversion:

python test-4llm.py slow_performance.pdf
Parsing 331 pages of 'slow_performance.pdf'...
100%|████████████████████████████████████████| 331/331 [05:14<00:00,  1.05it/s]
Info messages during parsing:
Performing full-page OCR on page.number=0/1...

Generating markdown text...
100%|███████████████████████████████████████| 331/331 [00:00<00:00, 725.11it/s]

robvd · November 20, 2025, 10:23am

Hi Harald,

That sounds great! Looking forward to try out the new version.

Topic		Replies	Views
Pymupdf4llm parsing takes excessively long time PyMuPDF	2	94	December 4, 2025
BUG: pymupdf4llm list index out of range in document_layout.py PyMuPDF	9	76	December 2, 2025
OCR disabled because OpenCV not installed PyMuPDF	16	147	January 6, 2026
Pymupdf4llm.to_markdown memory leak PyMuPDF	4	59	March 19, 2026
Pymupdf4llm unexpected reordering of output after v0.0.17 PyMuPDF	1	63	September 19, 2025

Pymupdf4llm: performance

Related topics