Pymupdf4llm parsing takes excessively long time

Hi,

I use pymupdf4llm to parse the content of some pdfs.
In most cases it works great, however, I found a few documents that take exceedingly long time to be parsed, and I cannot find a reason for it as they contain mostly text.

Attached is an example of such a document, the first 3 pages take ~7 minutes each to be parsed.

output1.pdf (661.3 KB)

Here is the code I used:

import io
from pathlib import Path
from time import perf_counter

import pymupdf
import pymupdf4llm

def main() → None:
pdf_file: Path = Path(“”)  # Put your local path here

pdf_content = pdf_file.read_bytes()
buf = io.BytesIO(pdf_content)
with pymupdf.open(stream=buf) as document_pymu:
    n_pages = document_pymu.page_count

    for page_nr in range(n_pages):
        t0 = perf_counter()
        _ = pymupdf4llm.to_markdown(
            doc=document_pymu,
            pages=[page_nr],
            table_strategy="lines",
        )
        page_extraction_time = perf_counter() - t0

        print(f"Content of page {page_nr} extracted in {page_extraction_time} seconds "
              f"({page_extraction_time/60:.2f} minutes)")

if __name__ == “__main__”:
    main()

Librairies versions:

  • pymupdf: 1.26.6
  • pymupdf4llm: 0.1.9

Would anyone have insights on the reasons this specific document takes so much time while most pdfs are parsed quickly and how to improve this ?

Hi @JasmineGMT
I would do three things to speed up your processing and to get better results:

1) I would crop and scale your PDF. Right now it has large dimensions and much white space. However this just means more processing time to cover all that space. Please see the attached Python file which would crop your original PSD down and then scale it to a more manageable A4 size.

crop-and-scale.py (1.4 KB)

2) I would install and run the latest PyMuPDF4LLM - pip install pymupdf4llm==0.2.6

3) I would install and use PyMuPDF Layout for faster, more improved results.

install: pip install pymupdf-layout

use: Just import pymupdf.layoutin the place of pymupdf, i.e:

import pymupdf.layout
import pymupdf4llm

def main() → None:
pdf_file: Path = Path("")  # 

…

For more on PyMuPDF Layout see: https://pymupdf.io & PyMuPDF Layout - PyMuPDF documentation

Let me know how it goes please!

Thanks @Jamie_Lemon

I tried updating pymupdf4llm, and the process speed falls to ~4 minutes per page. Better but still too long.

Then I switched to PyMuPDF Layout with use_ocr=False and the process would take approximately a minute for the whole document, which is quite acceptable to me :slightly_smiling_face:

Cropping and scaling are nice but I would like to be able to parse any pdf in reasonable time, even those with non optimal layout such as this one.

1 Like