I stumbled on another list index out of range. When parsing a large file using pymupdf.layout+pymupdf4llm the following traceback is encountered:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 83, in to_markdown
parsed_doc = parse_document(
File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 42, in parse_document
return document_layout.parse_document(
File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 908, in parse_document
utils.clean_tables(page, blocks)
File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/utils.py", line 261, in clean_tables
y_vals = [y_vals0[0]]
IndexError: list index out of range
Versions:
pymupdf4llm: 0.2.5
pymupdf-layout: 1.26.6
The commands used were:
doc=pymupdf.open(pdf_name)
md_chunks = pymupdf4llm.to_markdown(doc)
The size of the PDF file is 142MB so I cannot upload it here.
p.s. these files belong to the open data of the Dutch government and are important to parse. Unfortunately there is a great variety in quality and size of these files. On the other hand, they are great test cases ![]()