Pymupdf4llm.to_markdown memory leak

Zambonilli · March 18, 2026, 2:25pm

I’m seeing a non-managed memory leak with pymupdf4llm.to_markdown method when using layout mode instead of legacy mode. I’ve tried both batching via page_chunks = True and iterating each page and setting page=[pno]. I have confirmed that the leak is somewhere in the pymupdf4llm.to_markdown by using the same code but only dispatching on pymupdf methods like get_text and I’m seeing a steady-state for memory. Similarly, if I explicitly set the layout back to legacy mode via pymupdf4llm.use_layout(False) memory is steady state. Note, the latest versions will automatically default to layout True regardless if you install ocr & layout and import the pymupdf.layout module.

Environment

python== 3.11.13

PyMuPDF==1.27.2

pymupdf4llm[ocr,layout]==1.27.2.1
OS == macOS

arch == Apple silicon m4 (arm64)

tesseract via brew == 5.5.2

Minimum Reproducible Script

# pip install "PyMuPDF==1.27.2", "pymupdf4llm[ocr,layout]==1.27.2.1", "psutil==7.2.2"
import fitz
import pymupdf.layout  # type:ignore # noqa:F401
import pymupdf4llm
import tempfile
import gc
import psutil

with open(
    "<path-to-your-pdf>",
    "rb",
) as file:
    pdf_bytes = file.read()

start_rss = psutil.Process().memory_info().rss
print(f"start rss: {start_rss}b")
with tempfile.TemporaryDirectory() as tmp_dir:
    for i in range(1000):
        try:
            print(f"starting iteration {i}")
            with fitz.open(stream=pdf_bytes, filetype="pdf") as doc:
                for page_num, page in enumerate(doc):
                    pymupdf4llm.to_markdown(
                        doc,
                        write_images=True,
                        image_path=tmp_dir,
                        force_text=False,
                        header=False,
                        footer=False,
                        pages=[page_num],
                    )
            # doesn't free up memory
            gc.collect()
            fitz.TOOLS.store_shrink(100)
            rss = psutil.Process().memory_info().rss
            print(f"mem rss: {rss}b +{rss - start_rss}b")
        except Exception as exc:
            print(f"exception: {exc}")

I’m seeing memory balloon on a variety of large and smaller pdf documents being sent through code similar to the above script. I ended up switching back to legacy mode because we’re running in a container environment that has limited memory capacity <= 12GB.

HaraldLieder · March 18, 2026, 5:14pm

Welcome here @Zambonilli - please let us have an example file, or you saying this happens independently of the specific file?

Other comments:

Please stop using import fitz - this package name is deprecated since a long time now
Could you try to set use_ocr=False? This makes sureno OCR is ever attempted. Something that happens in Layout mode only anyway.

Zambonilli · March 18, 2026, 6:45pm

Thanks @HaraldLieder,

I’m seeing the memory growth with any pdf. So, large docs ramp faster but also seeing with trying to process a lot of one page docs.

I just tried the example script above with use_ocr=False and I did see memory being reclaimed, which I did not see in my real code before w/o use_ocr=False. I’ll try my best to switch back from legacy mode and reprocess the sample corpus of ~600 pdfs that I was seeing this memory leak with and see if this fixes it there too.

Would love for this to work because we saw really amazing results with the ONNX nn and being able to run on CPU reduces so many barriers over heavier weight solutions like docling or transformer models.

HaraldLieder · March 18, 2026, 7:18pm

Thanks for the information and the nice feedback! Glad to read that our effort for creating a solution with a smaller footprint, yet faster execution and at the same time being fully competitive in terms of quality results is appreciated!

That said, onnxruntime and is still a large package. I hope that it will become smaller - at least indirectly, by removing their current dependencies on sympy and mpmath. Both could be removed / replaced by using numpy features with fairly low effort …

It would be valuable information to learn whether disabling OCR has an effect on memory consumption.

Zambonilli · March 19, 2026, 3:48pm

I was able to set use_ocr=False and do explicit gc and shrink and rerun on my ~600 pdf corpus. While I’m seeing some memory being reclaimed, it still is growing incrementally. Whereas, the legacy version is not incrementally growing.

Topic		Replies	Views
BUG: pymupdf4llm list index out of range in document_layout.py PyMuPDF	9	75	December 2, 2025
Pymupdf4llm parsing takes excessively long time PyMuPDF	2	80	December 4, 2025
BUG: list index out of range using new layout feature PyMuPDF	16	99	December 11, 2025
Import pymupdf4llm silently activates pymupdf.layout and changes find_tables() results PyMuPDF font , watermarking	1	29	May 24, 2026
Pymupdf4llm: performance PyMuPDF	3	84	November 20, 2025

Pymupdf4llm.to_markdown memory leak

Related topics