fitz.Page.find_tables() misses last column when using horizontal_strategy="text"`

Hi! I’m using fitz.Page.find_tables() with horizontal_strategy="text" and vertical_strategy="lines" while explicitly passing vertical_lines. However, the last column is consistently missing from the extracted table, even though the corresponding text spans are properly placed and inside the search area.

Here’s the relevant code:

table_finder = page.find_tables(  
clip=table_content_rect, vertical_lines=columns_positions, vertical_strategy="lines", horizontal_strategy="text")

What I’ve already verified:

  • The text of the last column is clearly within doc_content_rect (red boundaries)
  • The column positions are correctly passed via vertical_lines. (green vertical lines)
  • I’ve tried increasing x_tolerance, snap_x_tolerance as well as reducing the min_words_horizontaland expanding the clip to the full page width (pymupdf_issue_output_2.png), but the last column is still not detected.
  • If I switch to horizontal_strategy="lines", the last column does appear — but row grouping becomes unreliable.

The content of the full pdf is sensitive but I attached two adapted images showing:

  • The red rectangle: the clip area used for detection.
  • The green vertical lines: the passed columns_positions.
  • The blue rectangle: the bbox corresponding to the rendered table, missing the last column despite the text and lines being correctly positioned.

Would appreciate any help in how to fix this. Thank you!


This looks like a fairly standard grid layout for the table so I’m surprised the last column isn’t being picked up. Is there anyway you can take one page of the PDF and redact the info and then attach it here? This is difficult to figure out without the source PDF page!

Thank you. This is a version of the PDF where I got the same results.
The coordinates of the clip rectangle within which the table is searched are: Rect(0, 116.22047244094489, 595.28, 805.0396062992127) and the column positions used to get vertical lines are:
Column positions: [56.69291338582678, 109.90771653543308, 176.77700787401577, 243.64629921259845, 310.51559055118116, 377.38488188976385, 444.25417322834653, 511.1234645669292, 581.1067716535433]
20250717-table-text 1.pdf (105.4 KB)

Thanks - taking a look at this …

1 Like

@Ana_Guedes I’m not going to pretend to understand this 100% , but if I do this:

results = page.find_tables(vertical_strategy="lines", horizontal_strategy="text", vertical_lines=[560,595])

By defining just the last vertical column as being important then it seems to deliver the results for 7 columns as expected.

Seems like this also works:
results = page.find_tables(vertical_strategy="lines", horizontal_strategy="text", vertical_lines=[560])

So in this case I guess we are encouraging PyMuPDF to really pay attention to just the start of the last column. Doesn’t;t really explain why it seemingly “gives up” before then though :slight_smile:

Hope this works for you!

Thank you for the help! If I use the vertical_lines=[560] it works indeed. However, how do you calculate that position? I can see it is where the text from last column’s cells ends but I need this to work dynamically since the text inside each cell might change (columns positions will always be the same but the content varies). I tried to use vertical_lines=[511], vertical_lines=[581] (las column boundaries) and vertical_lines=[585] (right edge of the pdf) and it does not work.

Thanks again :slight_smile:

I just guessed the position of the last column - not ideal I know!

Like I say I don’t understand it 100%- if I do a number between 560 - 564 it works, anything else seems to crop the content.

So basically there was no “calculation” on my part - I need to take a further look at the PDF to see if there is anything beyond “heuristics” here to figure it out.

Having said all this, more importantly, why should PyMuPDF seemingly ignore this last column? Feels a bit weird to me! @HaraldLieder

Hi Ana,
welcome here!
Here is a working script. The approach is based on the following insights:

  1. The parameters vertical_lines / horizontal_lines are always somewhat problematic, because they will cause the Finder to cease its own effort completely - see the documentation. It is almost always better to add any known information (lines and rectangles) vie parameters add_lines/add_boxes.
  2. Never worry about adding potentially redundant information: it will be swallowed by the Finder.
  3. This case has a few complications:
    • Text under the gray left rectangle has no line separators, so the finder assumes this is one single, multi-line cell. We address this by making the existing horizontal line drawings longer.
    • Those weird “|” things are no characters, but too short vertical lines :smirking_face:. Once understood, we can extend them too to the top and bottom.
  4. The good thing of this case is that the table bbox is correctly determined regardless. So we make two rounds of table finding: one for getting the table bbox, and another one with now enriched gridline information.
    test.py (904 Bytes)

Hope it helps!

1 Like

Hello Harald,

That script worked perfectly, thank you a lot for that and the clear explanation!

All is working now with the strategy @HaraldLieder provided. Thank you a lot for taking the time to look this up and trying to help.

Glad I could help! Enjoy PyMuPDF!

1 Like