I’m seeing a non-managed memory leak with pymupdf4llm.to_markdown method when using layout mode instead of legacy mode. I’ve tried both batching via page_chunks = True and iterating each page and setting page=[pno]. I have confirmed that the leak is somewhere in the pymupdf4llm.to_markdown by using the same code but only dispatching on pymupdf methods like get_text and I’m seeing a steady-state for memory. Similarly, if I explicitly set the layout back to legacy mode via pymupdf4llm.use_layout(False) memory is steady state. Note, the latest versions will automatically default to layout True regardless if you install ocr & layout and import the pymupdf.layout module.
Environment
python== 3.11.13
PyMuPDF==1.27.2
pymupdf4llm[ocr,layout]==1.27.2.1
OS == macOS
arch == Apple silicon m4 (arm64)
tesseract via brew == 5.5.2
Minimum Reproducible Script
# pip install "PyMuPDF==1.27.2", "pymupdf4llm[ocr,layout]==1.27.2.1", "psutil==7.2.2"
import fitz
import pymupdf.layout # type:ignore # noqa:F401
import pymupdf4llm
import tempfile
import gc
import psutil
with open(
"<path-to-your-pdf>",
"rb",
) as file:
pdf_bytes = file.read()
start_rss = psutil.Process().memory_info().rss
print(f"start rss: {start_rss}b")
with tempfile.TemporaryDirectory() as tmp_dir:
for i in range(1000):
try:
print(f"starting iteration {i}")
with fitz.open(stream=pdf_bytes, filetype="pdf") as doc:
for page_num, page in enumerate(doc):
pymupdf4llm.to_markdown(
doc,
write_images=True,
image_path=tmp_dir,
force_text=False,
header=False,
footer=False,
pages=[page_num],
)
# doesn't free up memory
gc.collect()
fitz.TOOLS.store_shrink(100)
rss = psutil.Process().memory_info().rss
print(f"mem rss: {rss}b +{rss - start_rss}b")
except Exception as exc:
print(f"exception: {exc}")
I’m seeing memory balloon on a variety of large and smaller pdf documents being sent through code similar to the above script. I ended up switching back to legacy mode because we’re running in a container environment that has limited memory capacity <= 12GB.