Removing watermark text

Bernd · January 24, 2026, 2:33pm

Hello,

I am using the PyMuPDF library to extract tables from PDF files. Therefore I use page.find_tables() which works really well in my case.

A small number of tables has a diagonal textual watermark in the background. When I extract the text from the single table cells, some of them will contain the part of the watermark within the cell boundaries as noise.

I can analyse the watermark by using page.get_text(“dict”) and looking for the “dir” field in the dictionary. If it contains a value pair different than (1.0, 0.0) or (0.0, -1.0) I know it’s a watermark.

What I would like to do now is to delete the watermark text or to override it with an empty string before extracting the table with page.find_tables().

I already saw that one could alter a PDF page with the information in the xref table but up to now I did not manage to code this correctly.

Can anybody give me a hint or a code snippet how to identify the watermark text in the xref table and how to write the change back to the PDF page?

Every comment is welcome even if it were “This can’t be done”.

HaraldLieder · January 26, 2026, 12:15pm

If the watermark is “normal” text things are bound get difficult. But before I say “forget it”, I’d like to have a look at an example.
There still exists the chance to come across lucky cases. For instance, the pesky text might be contained in its own, identifiable PDF object, in which case it could be removed, etc.

Bernd · January 27, 2026, 2:36pm

Hello Harald!

I cannot give you a real table due to confidentiality reasons but an example should suffice.

A typical table with a textual watermark looks like this:

I will extract the table with page.find_tables() and thereafter I will read the text of every table cell with page.get_text(“text”, clip=bbox). The second cell in row 3 will then give me “Toyota\nSome street\nSome town\nw” because the first letter of “watermark” is within the cell boundaries. This means noise that I would like to avoid.

When I call page.get_text(“dict”) I will get a result similar to that:

“lines”: [
{
“spans”: [
{
…,
“text”: “watermark”,
…
}
],
“wmode”: 0,
“dir”: [
-0.8660258650779724,
-0.49999910593032837
],
“bbox”: [
73.7010498046875,
285.8898620605469,
520.7906494140625,
556.9953002929688
]
}
]

So my idea was to read in some form of the PDF page, to overwrite the text “watermark” with ““, to write back the PDF page and then finally to proceed with reading the table with page.find_tables().

Topic		Replies	Views
Can not remove a box in the footer PyMuPDF	6	55	October 16, 2025
Pymupdf layout table detection issue PyMuPDF	14	93	February 24, 2026
Extracted page text includes annotations page_text = page.get_text("text") PyMuPDF	1	20	March 9, 2026
Any idea what is wrong with this PDF? PyMuPDF	6	166	July 9, 2025
Check for page.find_tables returning None PyMuPDF	1	22	December 3, 2025

Removing watermark text

Related topics