Pymupdf4llm: performance

The fix for ValueError: min() arg is an empty sequence · Issue #319 · pymupdf/pymupdf4llm · GitHub also mentioned fixing a performance issue. Although the performance for the file attached to that issue is now acceptable, I have many more pdfs to convert that are much slower. For example the attached file - processing takes many hours. I unfortunately don’t control the source of these pdfs.

Is there any way that the performance of the attached file can be sped up as well?

slow_performance.pdf (9.2 MB)

@robvd Welcome to the Forum!

Found the problem! This PDF uses PDF structure information in a crazy manner, slowing down things that normally can be done in a breeze.
As we currently ignore this information category at all we might as well get the whole monster out of the way.
Doing this speeds up processing this file enormously: markdown creation of the 331 pages took me less than 6 minutes (about 1 second per page).
Checking out ways to include this in the next version 0.2.3.

Here is the output protocol of a markdown conversion:

python test-4llm.py slow_performance.pdf
Parsing 331 pages of 'slow_performance.pdf'...
100%|████████████████████████████████████████| 331/331 [05:14<00:00,  1.05it/s]
Info messages during parsing:
Performing full-page OCR on page.number=0/1...

Generating markdown text...
100%|███████████████████████████████████████| 331/331 [00:00<00:00, 725.11it/s]

Hi Harald,

That sounds great! Looking forward to try out the new version.