BUG: pymupdf4llm list index out of range in document_layout.py

robvd · November 27, 2025, 10:51am

When parsing the attached file using pymupdf.layout+pymupdf4llm the following traceback is encountered:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 97, in to_markdown
    return parsed_doc.to_markdown(
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 672, in to_markdown
    output += list_item_to_md(box.textlines, list_item_levels[i])
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 391, in list_item_to_md
    line = textlines[0]
IndexError: list index out of range

Versions:

pymupdf4llm: 0.2.4

pymupdf-layout: 1.26.6

p.s. could not attach file because it was too large. Please tell me how to send it.

Jamie_Lemon · November 28, 2025, 1:15pm

What is the file size of the PDF?

robvd · November 28, 2025, 4:24pm

It is 27.9MB

Jamie_Lemon · November 28, 2025, 5:09pm

Thanks - have updated the site to allow up to 30MB of file size for attachment - so please try again!

robvd · November 28, 2025, 7:08pm

Here it is!

The commands used were:

doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)

file.pdf (26.6 MB)

Jamie_Lemon · November 28, 2025, 11:37pm

@robvd Thanks. The good news is that I can replicate this bug on my machine! Will investigate further next week to see what we can do here …

HaraldLieder · December 2, 2025, 12:23pm

If you upgrade to the latest PyMuPDF4LLM version 0.2.5, everything should work fine. This script:
import sys
from pathlib import Path
import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}“)
doc = pymupdf.open(sys.argv[1])
md = pymupdf4llm.to_markdown(
doc,
write_images=False, # do not write image files
embed_images=False, # embed images as base64 strings
image_format=“jpg”, # image format (embedded or written)
header=True, # include/omit page headers
footer=False, # include/omit page footers
pages=None,
show_progress=True,
)
Path(doc.name).with_suffix(”.md").write_bytes(md.encode())

produces this console output:
![image|690x90](upload://56Ma9wbNKiussTQfDjhtKtDMxJm.png)

Jamie_Lemon · December 2, 2025, 1:47pm

Can confirm it works for me with the latest version of pymupdf4llm. My script was:

import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

doc=pymupdf.open(“file.pdf”)

try:
  md = pymupdf4llm.to_markdown(
    doc,
    show_progress=True,
  )
  print(md)
except Exception as e:
  print(f’{e=}')
  input('Deliberate hang, press return to continue… ? ')

robvd · December 2, 2025, 3:42pm

I can also confirm it works with version 0.2.5. Thanks!

robvd · December 2, 2025, 4:06pm

Just one thing: if I convert multiple documents after each other, using
doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)

then in the output I see the Full-page OCR message accumulate, so e.g. when converting first file output was

Full-page OCR on page.number=5/6.
Full-page OCR on page.number=14/15.
Full-page OCR on page.number=23/24.

After the second file the output was:

Full-page OCR on page.number=5/6.
Full-page OCR on page.number=14/15.
Full-page OCR on page.number=23/24.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=1/2.

so it includes the output lines from the first file. Did not dive deep into this yet, just wanted to mention it.

Topic		Replies	Views
BUG: pymupdf4llm list index out of range in document_layout.py (2) PyMuPDF	3	64	December 4, 2025
BUG: list index out of range using new layout feature PyMuPDF	16	99	December 11, 2025
Bug: pymupdf4llm: mis-interpreted layout and IndexError on specific pages (insurance policy PDF) PyMuPDF	5	48	January 6, 2026
OCR disabled because OpenCV not installed PyMuPDF	16	138	January 6, 2026
Bug: pymupdf4llm: image path handling PyMuPDF	16	122	January 20, 2026

BUG: pymupdf4llm list index out of range in document_layout.py

Related topics