The fix for ValueError: min() arg is an empty sequence · Issue #319 · pymupdf/pymupdf4llm · GitHub also mentioned fixing a performance issue. Although the performance for the file attached to that issue is now acceptable, I have many more pdfs to convert that are much slower. For example the attached file - processing takes many hours. I unfortunately don’t control the source of these pdfs.
Is there any way that the performance of the attached file can be sped up as well?
Found the problem! This PDF uses PDF structure information in a crazy manner, slowing down things that normally can be done in a breeze.
As we currently ignore this information category at all we might as well get the whole monster out of the way.
Doing this speeds up processing this file enormously: markdown creation of the 331 pages took me less than 6 minutes (about 1 second per page).
Checking out ways to include this in the next version 0.2.3.