BUG: pymupdf4llm list index out of range in document_layout.py

When parsing the attached file using pymupdf.layout+pymupdf4llm the following traceback is encountered:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 97, in to_markdown
    return parsed_doc.to_markdown(
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 672, in to_markdown
    output += list_item_to_md(box.textlines, list_item_levels[i])
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 391, in list_item_to_md
    line = textlines[0]
IndexError: list index out of range

Versions:

pymupdf4llm: 0.2.4

pymupdf-layout: 1.26.6

p.s. could not attach file because it was too large. Please tell me how to send it.

What is the file size of the PDF?

It is 27.9MB

Thanks - have updated the site to allow up to 30MB of file size for attachment - so please try again!

Here it is!

The commands used were:

doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)

file.pdf (26.6 MB)

@robvd Thanks. The good news is that I can replicate this bug on my machine! Will investigate further next week to see what we can do here …

1 Like

If you upgrade to the latest PyMuPDF4LLM version 0.2.5, everything should work fine. This script:
import sys
from pathlib import Path
import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}“)
doc = pymupdf.open(sys.argv[1])
md = pymupdf4llm.to_markdown(
doc,
write_images=False, # do not write image files
embed_images=False, # embed images as base64 strings
image_format=“jpg”, # image format (embedded or written)
header=True, # include/omit page headers
footer=False, # include/omit page footers
pages=None,
show_progress=True,
)
Path(doc.name).with_suffix(”.md").write_bytes(md.encode())

produces this console output:
![image|690x90](upload://56Ma9wbNKiussTQfDjhtKtDMxJm.png)

Can confirm it works for me with the latest version of pymupdf4llm. My script was:

import pymupdf.layout
import pymupdf4llm

print(f"{pymupdf.version=}, {pymupdf4llm.version=}")

doc=pymupdf.open(“file.pdf”)

try:
  md = pymupdf4llm.to_markdown(
    doc,
    show_progress=True,
  )
  print(md)
except Exception as e:
  print(f’{e=}')
  input('Deliberate hang, press return to continue… ? ')

I can also confirm it works with version 0.2.5. Thanks!

1 Like

Just one thing: if I convert multiple documents after each other, using
doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)

then in the output I see the Full-page OCR message accumulate, so e.g. when converting first file output was

Full-page OCR on page.number=5/6.
Full-page OCR on page.number=14/15.
Full-page OCR on page.number=23/24.


After the second file the output was:

Full-page OCR on page.number=5/6.
Full-page OCR on page.number=14/15.
Full-page OCR on page.number=23/24.
Full-page OCR on page.number=0/1.
Full-page OCR on page.number=1/2.


so it includes the output lines from the first file. Did not dive deep into this yet, just wanted to mention it.