Pymupdf4llm.to_markdown memory leak

I’m seeing a non-managed memory leak with pymupdf4llm.to_markdown method when using layout mode instead of legacy mode. I’ve tried both batching via page_chunks = True and iterating each page and setting page=[pno]. I have confirmed that the leak is somewhere in the pymupdf4llm.to_markdown by using the same code but only dispatching on pymupdf methods like get_text and I’m seeing a steady-state for memory. Similarly, if I explicitly set the layout back to legacy mode via pymupdf4llm.use_layout(False) memory is steady state. Note, the latest versions will automatically default to layout True regardless if you install ocr & layout and import the pymupdf.layout module.

Environment

python== 3.11.13

PyMuPDF==1.27.2

pymupdf4llm[ocr,layout]==1.27.2.1
OS == macOS

arch == Apple silicon m4 (arm64)

tesseract via brew == 5.5.2

Minimum Reproducible Script

# pip install "PyMuPDF==1.27.2", "pymupdf4llm[ocr,layout]==1.27.2.1", "psutil==7.2.2"
import fitz
import pymupdf.layout  # type:ignore # noqa:F401
import pymupdf4llm
import tempfile
import gc
import psutil

with open(
    "<path-to-your-pdf>",
    "rb",
) as file:
    pdf_bytes = file.read()

start_rss = psutil.Process().memory_info().rss
print(f"start rss: {start_rss}b")
with tempfile.TemporaryDirectory() as tmp_dir:
    for i in range(1000):
        try:
            print(f"starting iteration {i}")
            with fitz.open(stream=pdf_bytes, filetype="pdf") as doc:
                for page_num, page in enumerate(doc):
                    pymupdf4llm.to_markdown(
                        doc,
                        write_images=True,
                        image_path=tmp_dir,
                        force_text=False,
                        header=False,
                        footer=False,
                        pages=[page_num],
                    )
            # doesn't free up memory
            gc.collect()
            fitz.TOOLS.store_shrink(100)
            rss = psutil.Process().memory_info().rss
            print(f"mem rss: {rss}b +{rss - start_rss}b")
        except Exception as exc:
            print(f"exception: {exc}")

I’m seeing memory balloon on a variety of large and smaller pdf documents being sent through code similar to the above script. I ended up switching back to legacy mode because we’re running in a container environment that has limited memory capacity <= 12GB.

Welcome here @Zambonilli - please let us have an example file, or you saying this happens independently of the specific file?

Other comments:

  • Please stop using import fitz - this package name is deprecated since a long time now
  • Could you try to set use_ocr=False? This makes sureno OCR is ever attempted. Something that happens in Layout mode only anyway.

Thanks @HaraldLieder,

I’m seeing the memory growth with any pdf. So, large docs ramp faster but also seeing with trying to process a lot of one page docs.

I just tried the example script above with use_ocr=False and I did see memory being reclaimed, which I did not see in my real code before w/o use_ocr=False. I’ll try my best to switch back from legacy mode and reprocess the sample corpus of ~600 pdfs that I was seeing this memory leak with and see if this fixes it there too.

Would love for this to work because we saw really amazing results with the ONNX nn and being able to run on CPU reduces so many barriers over heavier weight solutions like docling or transformer models.

Thanks for the information and the nice feedback! Glad to read that our effort for creating a solution with a smaller footprint, yet faster execution and at the same time being fully competitive in terms of quality results is appreciated!

That said, onnxruntime and is still a large package. I hope that it will become smaller - at least indirectly, by removing their current dependencies on sympy and mpmath. Both could be removed / replaced by using numpy features with fairly low effort …

It would be valuable information to learn whether disabling OCR has an effect on memory consumption.

I was able to set use_ocr=False and do explicit gc and shrink and rerun on my ~600 pdf corpus. While I’m seeing some memory being reclaimed, it still is growing incrementally. Whereas, the legacy version is not incrementally growing.