BUG: list index out of range using new layout feature

I am not allowed to share the document, but maybe the exception helps as well:

task_agent/document_processing/parser/pymupdf_reader.py:71: in to_markdown
md = pymupdf4llm.to_markdown(
../../../anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py:83: in to_markdown
parsed_doc = parse_document(
../../../anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py:42: in parse_document
return document_layout.parse_document(
../../../anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/helpers/document_layout.py:908: in parse_document
utils.clean_tables(page, blocks)
../../../anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/helpers/utils.py:261: in clean_tables
y_vals = [y_vals0[0]]

Versions:
- pymupdf-layout==1.26.6
- pymupdf4llm==0.2.5

The table covers more then page.

Good to know your table spans over the page. I think this might be the same issue as: BUG: pymupdf4llm list index out of range in document_layout.py as well. Will investigate.

I can see you are on the latest version of pymupdf4llm so it isn’t related to the other issue. Hard to test without the document. If you set show_progress to True do you know how far it gets into the document?

Just want to bump this up as we are experiencing the issue as well

Hi @qbuchanan Welcome to the forum! Are you able to share your PDF? Also can you confirm your versions of PyMuPDF Layout and PyMuPDF4LLM ? ( I’m hoping 1.26.6 and 0.2.5 )

Hi Jamie

Attached you will find an example that should help to reproduce the issue.

Regards, Marcel

(attachments)

example.pdf (54.5 KB)

PyMuPDF~=1.26.6

pymupdf-layout~=1.26.6

pymupdf4llm~=0.2.5

Thanks Marcel - the document really helps , will investigate.

@marcelrassinger This should hopefully be fixed for you with the new version of PyMuPDF 0.2.6 (pip install pymupdf4llm==0.2.6)
@qbuchanan Perhaps you can give things a go again with the latest version? Basically there was an error with some of the object classification in the previous version which caused the issue.

Please let me know how it goes for you and if your issues are resolved!

Hi Jamie,

Bug is fixed, thank you!

However, there seems to be another small glitch.

I call:

md = pymupdf4llm.to_markdown(
doc=“pdf-path",
write_images=True,
image_path=“my-image-path",
embed_images=False,
)

After processing, I get images for each page in the image path (see attached zip file), but I also get one image put besides the parsed pdf file:

image.png

It looks like the logo:

This happens in all my test cases. On purpose?

Can I switch it off?

Thanks,
Marcel

(Attachment pdf_parser.zip is missing)

Hi Jamie,

Bug is fixed, thank you!

However, there seems to be another small glitch.

I call:

md = pymupdf4llm.to_markdown(
doc=“pdf-path",
write_images=True,
image_path=“my-image-path",
embed_images=False,
)

After processing, I get images for each page in the image path, but I also get one image put besides the parsed pdf file:

image.png

It looks like the logo:

many-csv-order-positions.pdf-0001-00.png

This happens in all my test cases. On purpose?

Can I switch it off?

Thanks,
Marcel

Another Bug?:

I call:

md = pymupdf4llm.to_markdown(
doc=‘storage/medidor-test.ch/email_data/gmail.com/esid_025d93140ff06a84636eee46426608433dfdd3dec4c3a9c73a9e3a095b127526/Bestellung 94833.pdf’,
write_images=True,
image_path=‘storage/medidor-test.ch/parsing_working_dir/esid_025d93140ff06a84636eee46426608433dfdd3dec4c3a9c73a9e3a095b127526/pdf_parser/md_pymupdf4llm_conversion’,
embed_images=False,
)

And I get the following exception:

File “/Users/mara/Code/agents/task_agent/task_agent/document_processing/parser/pymupdf_reader.py”, line 71, in to_markdown
md = pymupdf4llm.to_markdown(
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py”, line 83, in to_markdown
parsed_doc = parse_document(
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py”, line 42, in parse_document
return document_layout.parse_document(
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/helpers/document_layout.py”, line 963, in parse_document
pix.save(layoutbox.image)
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/init.py”, line 13894, in save
return self._writeIMG(filename, idx, jpg_quality)
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/init.py”, line 13573, in writeIMG
if format
== 1: mupdf.fz_save_pixmap_as_png(pm, filename)
File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/mupdf.py”, line 51161, in fz_save_pixmap_as_png
return _mupdf.fz_save_pixmap_as_png(pixmap, filename)

pymupdf.mupdf.FzErrorSystem: code=2: cannot open file 'storage/medidor-test.ch/parsing_working_dir/esid_025d93140ff06a84636eee46426608433dfdd3dec4c3a9c73a9e3a095b127526/pdf_parser/md_pymupdf4llm_conversion/storage/medidor-test.ch/email_data/gmail.com/esid_025d93140ff06a84636eee46426608433dfd

Somehow the pathes get concatenated…

Do I use it incorrectly? But then, why did it work before?

@marcelrassinger I m unable to replicate your issue - I don’t think there is a character length for the image_path value. When I ask for images to be extracted they faithfully go to the folder I define. I don’t have your attached zip file so didn’t try with your “many-civ-order-positions.pdf”

Hi Jamie,

Below you find a simple example to reproduce the issue. Running the code results in an exception. It works, if you comment out the layout import.

import pymupdf.layout
import pymupdf4llm
md = pymupdf4llm.to_markdown(
doc=“./pdfs/example.pdf”,
write_images=True,
image_path=“./images”,
embed_images=False,
)
print(md)

The folder structure is:

The strange thing is, it also works if I move example.pdf into the same folder as the python script and set doc=“./example.pdf”.

I use Python 3.13.5

Regards, Marcel

Exception:

python pymupdf_example.py

Traceback (most recent call last):

File “/Users/mara/Downloads/test/pymupdf_example.py”, line 3, in

md = pymupdf4llm.to_markdown(

doc=“./pdfs/example.pdf”,

…<2 lines>…

embed_images=False,

)

File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py”, line 83, in to_markdown

parsed_doc = parse_document(

doc,

…<10 lines>…

use_ocr=use_ocr,

)

File “/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/init.py”, line 42, in parse_document

return document_layout.parse_document**(**


**doc,**

**^^^^**

...<10 lines>...

**use_ocr=use_ocr,**

**^^^^^^^^^^^^^^^^**

**)**

**^**

File "/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf4llm/helpers/document_layout.py", line 963, in parse_document

pix.save**(layoutbox.image)**

~~~~~~~~**^^^^^^^^^^^^^^^^^**

File "/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/__init__.py", line 13894, in save

return self._writeIMG**(filename, idx, jpg_quality)**

~~~~~~~~~~~~~~**^^^^^^^^^^^^^^^^^^^^^^^^^^^^**

File "/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/__init__.py", line 13573, in _writeIMG

if format_ == 1: mupdf.fz_save_pixmap_as_png**(pm, filename)**

~~~~~~~~~~~~~~~~~~~~~~~~~~~**^^^^^^^^^^^^^^**

File "/Users/mara/anaconda3/envs/task-agent/lib/python3.13/site-packages/pymupdf/mupdf.py", line 51161, in fz_save_pixmap_as_png

return _mupdf.fz_save_pixmap_as_png**(pixmap, filename)**

~~~~~~~~~~~~~~~~~~~~~~~~~~~~**^^^^^^^^^^^^^^^^^^**

**pymupdf.mupdf.FzErrorSystem**: code=2: cannot open file './images/./pdfs/example.pdf-0001-00.png': No such file or directory

That appears to have fixed the issue :slight_smile:

1 Like

It looks like the issue comes from how the new pymupdf.layout import affects image-path generation in pymupdf4llm. When the PDF is in a subfolder, the layout engine builds an incorrect output path like ./images/./pdfs/example.pdf-0001-00.png, but the nested directory doesn’t exist, causing the “cannot open file” error. It works when the PDF is in the same folder as the script because the path doesn’t get nested.https://ieditonline.com/ Removing the pymupdf.layout import or manually creating the expected subfolders is a temporary workaround until the bug is patched.