Bug: pymupdf4llm: mis-interpreted layout and IndexError on specific pages (insurance policy PDF)

Hello,

The announcement on GitHub issue #295 directed me here. I hope this is the right place to report bugs.

Background Context

I am trying to use pymupdf4llm to parse an insurance policy document. This is a typical legal document in Taiwan with the following characteristics:

  • Double-column layout.

  • Complex ordered lists:
    It uses a mixture of

    • Mandarin numerals with punctuation (e.g., 一、, (一)),
    • ASCII numbers, and
    • full-width numbers (e.g., 1, 1-1.)

    as list markers.

  • Hierarchical headings: E.g., “第一章” (Chapter 1), “第一條” (Article 1).

  • Mixed content: The text is often interspersed with tables and images.

The Document

The PDF I am testing is from Tokio Marine Newa Insurance Co., Ltd.:
Download Link (from official website www.tmnewa.com.tw)

The Issues

I encountered two main issues when running pymupdf4llm.to_text:

  1. Table Misidentification (Page 6):
    The parser incorrectly identifies the text in the left column as part of a table, messing up the layout structure.
  2. IndexError (Page 7):
    Processing page 7 causes the script to crash with an IndexError: list index out of range.

Reproduction / Traceback

Here is the code snippet and the full traceback:

Environment

  • Ubuntu 24.04.3 LTS x86_64 (in Windows 11 x64 24H2 WSL2)
  • Python 3.12.12
  • pymupdf 1.26.6
  • pymupdf-layout 1.26.6
  • pymupdf4llm 0.2.8 (latest versions installed via uv)

Code Snippet

from pathlib import Path
import pymupdf.layout
import pymupdf4llm

# Assuming the PDF is downloaded to ./docs/insurance.pdf
doc_path = Path('./docs/insurance.pdf')

# This triggers the crash
pymupdf4llm.to_text(doc_path, pages=[7])

# And this is one of the malformed pages.
# The printed text is too long, so I commented it out here
# to prevent cluttering the output.
#
# print(pymupdf4llm.to_text(doc_path, pages=[6]))

Traceback and Malformed Output

OCR disabled because OpenCV not installed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/__init__.py", line 171, in to_text
    return parsed_doc.to_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py", line 842, in to_text
    text_string += list_item_to_text(box.textlines, list_item_levels[i])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py", line 224, in list_item_to_text
    line = textlines[0]
           ~~~~~~~~~^^^
IndexError: list index out of range
Malformed Page (Partial)
OCR disabled because OpenCV not installed.
+-----------------------------------+------+-------+-----------------------+-----+------+---+-------+--------+---------------------+-----+------+
| 故失蹤,自戶籍資料所載失蹤之日起滿一年仍未尋獲,或要保人、受益人能 |      |       |                       |     |      |   | 項目    | 項次     | 失能程度                | 失能等 | 給付   |
| 提出證明文件足以認為被保險人極可能因本保險契約所約定之意外傷害事  |      |       |                       |     |      |   |       |        |                     | 級   | 比例   |
| 故而死亡者,本公司按第九十二條約定先行給付身故保險金或喪葬費用保險 |      |       |                       |     |      |   |       |        |                     |     |      |
| 金。但日後發現被保險人生還時,要保人或受益人應將該筆已領之退還已繳 |      |       |                       |     |      |   |       |        |                     |     |      |

Any advice or fixes would be greatly appreciated.
Thank you.

Interesting - it seems this only happens when using the to_textmethod. to_markdown and to_json appear to work without crashing.

Are you able to use one of the other methods to get the text you need? Note, like you, I see page index 6 mistakenly coming back as a table format - I’m not sure how to resolve that without perhaps training the Layout module more.

@HaraldLieder Any ideas why to_textcauses the index out of bounds problem on page index 7 but the other methods don’t?

Well, must be an error in that function. In many cases, plain text and MD text handling differ too much from each other to be handleable by the same output functions. So it could be that situations behave somewhat differently …

I’ve had a more detailed look at that file in the meantime:

  • Page 7 (6 zero-based) is completely messed up by the layout plugin PyMuPDF-Layout. There is currently no way of healing this. The other pages don’t seem to exhibit major problems.
  • The index error is a trivial bug in pymupdf4llm itself: In some cases, layout bboxes like “list-item” unexpectedly contain no text at all (and such cases are irregular of course). We simply need to check this condition and return an empty text string. That check wasn’t there in the reported case (but was present for MD output).

I will include the page with the severe problem (page 7) in the issue list of PyMuPDF-Layout.

1 Like

Thank you, Jamie_Lemon and HaraldLieder, for the prompt replies and the explanation!

I have tested to_markdown as suggested, and I can confirm it successfully processes page 7 without crashing. I will switch to to_markdown for now for my current workflow to avoid the IndexError .

Regarding the layout issue on page 6 (where the column is mistaken for a table), I understand that layout conventions common in East Asia might differ significantly from the typical training data, making them tricky for layout analysis models.

Harald mentioned adding page 7 to the issue list. I was wondering if page index 6 might also be a useful test case or training data regarding these specific layout conventions? If so, please feel free to add it to the list for future reference as well.

Thanks again for the quick reply!

Just to confirm - @HaraldLieder has added page index 6 (page 7) to the issue list. @yisiang.fu When you say “Regarding the layout issue on page 6 (where the column is mistaken for a table)” I think you mean “Regarding the layout issue on page index 6 (where the column is mistaken for a table)”

I wish the page references in the API weren’t zero-indexed sometimes!