Hi,
I use pymupdf4llm to parse the content of some pdfs.
In most cases it works great, however, I found a few documents that take exceedingly long time to be parsed, and I cannot find a reason for it as they contain mostly text.
Attached is an example of such a document, the first 3 pages take ~7 minutes each to be parsed.
output1.pdf (661.3 KB)
Here is the code I used:
import io
from pathlib import Path
from time import perf_counter
import pymupdf
import pymupdf4llm
def main() → None:
pdf_file: Path = Path(“”) # Put your local path here
pdf_content = pdf_file.read_bytes()
buf = io.BytesIO(pdf_content)
with pymupdf.open(stream=buf) as document_pymu:
n_pages = document_pymu.page_count
for page_nr in range(n_pages):
t0 = perf_counter()
_ = pymupdf4llm.to_markdown(
doc=document_pymu,
pages=[page_nr],
table_strategy="lines",
)
page_extraction_time = perf_counter() - t0
print(f"Content of page {page_nr} extracted in {page_extraction_time} seconds "
f"({page_extraction_time/60:.2f} minutes)")
if __name__ == “__main__”:
main()
Librairies versions:
- pymupdf: 1.26.6
- pymupdf4llm: 0.1.9
Would anyone have insights on the reasons this specific document takes so much time while most pdfs are parsed quickly and how to improve this ?