Hello,
The announcement on GitHub issue #295 directed me here. I hope this is the right place to report bugs.
Background Context
I am trying to use pymupdf4llm to parse an insurance policy document. This is a typical legal document in Taiwan with the following characteristics:
-
Double-column layout.
-
Complex ordered lists:
It uses a mixture of- Mandarin numerals with punctuation (e.g., 一、, (一)),
- ASCII numbers, and
- full-width numbers (e.g., 1, 1-1.)
as list markers.
-
Hierarchical headings: E.g., “第一章” (Chapter 1), “第一條” (Article 1).
-
Mixed content: The text is often interspersed with tables and images.
The Document
The PDF I am testing is from Tokio Marine Newa Insurance Co., Ltd.:
Download Link (from official website www.tmnewa.com.tw)
The Issues
I encountered two main issues when running pymupdf4llm.to_text:
- Table Misidentification (Page 6):
The parser incorrectly identifies the text in the left column as part of a table, messing up the layout structure. IndexError(Page 7):
Processing page 7 causes the script to crash with anIndexError: list index out of range.
Reproduction / Traceback
Here is the code snippet and the full traceback:
Environment
- Ubuntu 24.04.3 LTS x86_64 (in Windows 11 x64 24H2 WSL2)
- Python 3.12.12
pymupdf1.26.6pymupdf-layout1.26.6pymupdf4llm0.2.8 (latest versions installed via uv)
Code Snippet
from pathlib import Path
import pymupdf.layout
import pymupdf4llm
# Assuming the PDF is downloaded to ./docs/insurance.pdf
doc_path = Path('./docs/insurance.pdf')
# This triggers the crash
pymupdf4llm.to_text(doc_path, pages=[7])
# And this is one of the malformed pages.
# The printed text is too long, so I commented it out here
# to prevent cluttering the output.
#
# print(pymupdf4llm.to_text(doc_path, pages=[6]))
Traceback and Malformed Output
OCR disabled because OpenCV not installed.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/__init__.py", line 171, in to_text
return parsed_doc.to_text(
^^^^^^^^^^^^^^^^^^^
File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py", line 842, in to_text
text_string += list_item_to_text(box.textlines, list_item_levels[i])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/proj/.venv/lib/python3.12/site-packages/pymupdf4llm/helpers/document_layout.py", line 224, in list_item_to_text
line = textlines[0]
~~~~~~~~~^^^
IndexError: list index out of range
Malformed Page (Partial)
OCR disabled because OpenCV not installed.
+-----------------------------------+------+-------+-----------------------+-----+------+---+-------+--------+---------------------+-----+------+
| 故失蹤,自戶籍資料所載失蹤之日起滿一年仍未尋獲,或要保人、受益人能 | | | | | | | 項目 | 項次 | 失能程度 | 失能等 | 給付 |
| 提出證明文件足以認為被保險人極可能因本保險契約所約定之意外傷害事 | | | | | | | | | | 級 | 比例 |
| 故而死亡者,本公司按第九十二條約定先行給付身故保險金或喪葬費用保險 | | | | | | | | | | | |
| 金。但日後發現被保險人生還時,要保人或受益人應將該筆已領之退還已繳 | | | | | | | | | | | |
Any advice or fixes would be greatly appreciated.
Thank you.