I am using the PyMuPDF library to extract tables from PDF files. Therefore I use page.find_tables() which works really well in my case.
A small number of tables has a diagonal textual watermark in the background. When I extract the text from the single table cells, some of them will contain the part of the watermark within the cell boundaries as noise.
I can analyse the watermark by using page.get_text(“dict”) and looking for the “dir” field in the dictionary. If it contains a value pair different than (1.0, 0.0) or (0.0, -1.0) I know it’s a watermark.
What I would like to do now is to delete the watermark text or to override it with an empty string before extracting the table with page.find_tables().
I already saw that one could alter a PDF page with the information in the xref table but up to now I did not manage to code this correctly.
Can anybody give me a hint or a code snippet how to identify the watermark text in the xref table and how to write the change back to the PDF page?
Every comment is welcome even if it were “This can’t be done”.
If the watermark is “normal” text things are bound get difficult. But before I say “forget it”, I’d like to have a look at an example.
There still exists the chance to come across lucky cases. For instance, the pesky text might be contained in its own, identifiable PDF object, in which case it could be removed, etc.
I will extract the table with page.find_tables() and thereafter I will read the text of every table cell with page.get_text(“text”, clip=bbox). The second cell in row 3 will then give me “Toyota\nSome street\nSome town\nw” because the first letter of “watermark” is within the cell boundaries. This means noise that I would like to avoid.
When I call page.get_text(“dict”) I will get a result similar to that:
So my idea was to read in some form of the PDF page, to overwrite the text “watermark” with ““, to write back the PDF page and then finally to proceed with reading the table with page.find_tables().