fitz.Page.find_tables() misses last column when using horizontal_strategy="text"`

Ana_Guedes · July 17, 2025, 9:53am

Hi! I’m using fitz.Page.find_tables() with horizontal_strategy="text" and vertical_strategy="lines" while explicitly passing vertical_lines. However, the last column is consistently missing from the extracted table, even though the corresponding text spans are properly placed and inside the search area.

Here’s the relevant code:

table_finder = page.find_tables(  
clip=table_content_rect, vertical_lines=columns_positions, vertical_strategy="lines", horizontal_strategy="text")

What I’ve already verified:

The text of the last column is clearly within doc_content_rect (red boundaries)
The column positions are correctly passed via vertical_lines. (green vertical lines)
I’ve tried increasing x_tolerance, snap_x_tolerance as well as reducing the min_words_horizontaland expanding the clip to the full page width (pymupdf_issue_output_2.png), but the last column is still not detected.
If I switch to horizontal_strategy="lines", the last column does appear — but row grouping becomes unreliable.

The content of the full pdf is sensitive but I attached two adapted images showing:

The red rectangle: the clip area used for detection.
The green vertical lines: the passed columns_positions.
The blue rectangle: the bbox corresponding to the rendered table, missing the last column despite the text and lines being correctly positioned.

Would appreciate any help in how to fix this. Thank you!

Jamie_Lemon · July 17, 2025, 1:46pm

This looks like a fairly standard grid layout for the table so I’m surprised the last column isn’t being picked up. Is there anyway you can take one page of the PDF and redact the info and then attach it here? This is difficult to figure out without the source PDF page!

Ana_Guedes · July 17, 2025, 3:52pm

Thank you. This is a version of the PDF where I got the same results.
The coordinates of the clip rectangle within which the table is searched are: Rect(0, 116.22047244094489, 595.28, 805.0396062992127) and the column positions used to get vertical lines are:
Column positions: [56.69291338582678, 109.90771653543308, 176.77700787401577, 243.64629921259845, 310.51559055118116, 377.38488188976385, 444.25417322834653, 511.1234645669292, 581.1067716535433]
20250717-table-text 1.pdf (105.4 KB)

Jamie_Lemon · July 17, 2025, 4:07pm

Thanks - taking a look at this …

Jamie_Lemon · July 17, 2025, 4:48pm

@Ana_Guedes I’m not going to pretend to understand this 100% , but if I do this:

results = page.find_tables(vertical_strategy="lines", horizontal_strategy="text", vertical_lines=[560,595])

By defining just the last vertical column as being important then it seems to deliver the results for 7 columns as expected.

Seems like this also works:
results = page.find_tables(vertical_strategy="lines", horizontal_strategy="text", vertical_lines=[560])

So in this case I guess we are encouraging PyMuPDF to really pay attention to just the start of the last column. Doesn’t;t really explain why it seemingly “gives up” before then though

Hope this works for you!

Ana_Guedes · July 17, 2025, 5:27pm

Thank you for the help! If I use the vertical_lines=[560] it works indeed. However, how do you calculate that position? I can see it is where the text from last column’s cells ends but I need this to work dynamically since the text inside each cell might change (columns positions will always be the same but the content varies). I tried to use vertical_lines=[511], vertical_lines=[581] (las column boundaries) and vertical_lines=[585] (right edge of the pdf) and it does not work.

Thanks again

Jamie_Lemon · July 17, 2025, 5:36pm

I just guessed the position of the last column - not ideal I know!

Like I say I don’t understand it 100%- if I do a number between 560 - 564 it works, anything else seems to crop the content.

So basically there was no “calculation” on my part - I need to take a further look at the PDF to see if there is anything beyond “heuristics” here to figure it out.

Having said all this, more importantly, why should PyMuPDF seemingly ignore this last column? Feels a bit weird to me! @HaraldLieder

HaraldLieder · July 18, 2025, 4:49pm

Hi Ana,
welcome here!
Here is a working script. The approach is based on the following insights:

The parameters vertical_lines / horizontal_lines are always somewhat problematic, because they will cause the Finder to cease its own effort completely - see the documentation. It is almost always better to add any known information (lines and rectangles) vie parameters add_lines/add_boxes.
Never worry about adding potentially redundant information: it will be swallowed by the Finder.
This case has a few complications:
- Text under the gray left rectangle has no line separators, so the finder assumes this is one single, multi-line cell. We address this by making the existing horizontal line drawings longer.
- Those weird “|” things are no characters, but too short vertical lines . Once understood, we can extend them too to the top and bottom.
The good thing of this case is that the table bbox is correctly determined regardless. So we make two rounds of table finding: one for getting the table bbox, and another one with now enriched gridline information.
test.py (904 Bytes)

Hope it helps!

Ana_Guedes · July 21, 2025, 10:01am

Hello Harald,

That script worked perfectly, thank you a lot for that and the clear explanation!

Ana_Guedes · July 21, 2025, 10:03am

All is working now with the strategy @HaraldLieder provided. Thank you a lot for taking the time to look this up and trying to help.

HaraldLieder · July 21, 2025, 10:17am

Glad I could help! Enjoy PyMuPDF!

Topic		Replies	Views
Pymupdf layout table detection issue PyMuPDF	14	112	February 24, 2026
Check for page.find_tables returning None PyMuPDF	1	24	December 3, 2025
Bug: pymupdf4llm: mis-interpreted layout and IndexError on specific pages (insurance policy PDF) PyMuPDF	5	43	January 6, 2026
BUG: double column pdfs text extracted in wrong order PyMuPDF	2	46	January 16, 2026
Removing watermark text PyMuPDF	2	63	January 27, 2026

fitz.Page.find_tables() misses last column when using horizontal_strategy="text"`

Related topics