BUG: pymupdf4llm list index out of range in document_layout.py (2)

I stumbled on another list index out of range. When parsing a large file using pymupdf.layout+pymupdf4llm the following traceback is encountered:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 83, in to_markdown
    parsed_doc = parse_document(
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 42, in parse_document
    return document_layout.parse_document(
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 908, in parse_document
    utils.clean_tables(page, blocks)
  File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/utils.py", line 261, in clean_tables
    y_vals = [y_vals0[0]]
IndexError: list index out of range

Versions:

pymupdf4llm: 0.2.5

pymupdf-layout: 1.26.6

The commands used were:

doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)

The size of the PDF file is 142MB so I cannot upload it here.

p.s. these files belong to the open data of the Dutch government and are important to parse. Unfortunately there is a great variety in quality and size of these files. On the other hand, they are great test cases :wink:

This problem should have been fixed in pymupdf4llm version 0.2.6.

@robvd Are you able to share the open data link to the PDFs maybe? Hoping indeed that the new PyMuPDF4LLM 0.2.6 resolves your issue, at least it resolved the similar issue here: BUG: list index out of range using new layout feature - #10 by Jamie_Lemon

It is indeed working with version 0.2.6.

@Jamie_Lemon I had stored this file locally once because it caused trouble - unfortunately I did not save the original url. If you want I can send the file e.g. using WeTransfer, just dm me your email address.

1 Like