how to preserve page structure( table, layout) while doing OCR

hi team

for the general pdf (not images)

i am able to extract the table content, and redact the information what ever i needed, it is working fine,

now we have scanned pdf

i used OCR extact the information

full_tp = page.get_textpage_ocr(
flags=0,
dpi=300,
full=True,
language=‘eng’,
tessdata=r"C:\Users\3330\AppData\Local\Programs\Tesseract-OCR\tessdata"

after doing OCR completely lost the layout

how to preserve the layout like tables, page structure while doing OCR, so that i can iterate the table and redact the rows which ever not needed.

redaction is not applied correctly after applying OCR

search_terms = [“Partner Acknowledged”, “0060584486”, “5,555,390.00”, “0.00”]

for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found ‘{term}’ at positions: {rectangles}")
for rect in rectangles:

Mark the area for redaction

page.add_redact_annot(rect)
else:
print(f"‘{term}’ not found on this page.")

page.apply_redactions()

for general pdf everything is working fine.

search_terms = [“Partner Acknowledged”, “0060584486”, “5,555,390.00”, “0.00”]

for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found ‘{term}’ at positions: {rectangles}")
for rect in rectangles:

Mark the area for redaction

page.add_redact_annot(rect)
else:
print(f"‘{term}’ not found on this page.")

page.apply_redactions()

how to tackle below cases while using OCR

1) preserving structure (Tables , layout )

2) applying reduction correctly

image

Welcome @Ranjith !

OCR is a forgetful function! Not everything you saw on the original page is still extractable on the OCR-ed version. This applies especially to graphics (lines, shadings, …).

In addition, every words receives its own bbox according to the positions of the pixels corresponding to each character. This means that e.g. the word “area” now lives in a bbox that has a smaller height than words “border” or “Gag” - even when they originally were written with the same font and font size. The top and bottom (y0, y1) coordinates are no longer the same as before.

You have also lost all text meta-information like color, bold, italic, font, font size, etc. The font size shown in the extracted text will be different from the original … in fact: it is back-computed from the boundary box and thus has lost its original significance.

Specifically for Tesseract, you must be aware that it generates OCR text with a mono-spaced font (“GlyphLessFont”)! Therefore, it is impossible to find the (correct) bbox of single characters inside an OCR-ed word.

And I hope it is clear that all invisible content (which was extractable in the original page) has vanished completely.

Therefore, it is unrealistic to expect that table extraction can still work in the same way as before.

@HaraldLieder
thanks for the rply

i have tired few approches , among them below seems to be working

first i applied OCRmyPDF , so my pdf is searchable

next , using opencv i have detected the table (horizonal , vertical lines)

converted opencv points ( (horizonal , vertical lines) to pymupdf points and drawn rectangles using those points to create tables on the orginal pdf

now i used pymupdf to detect the table

it is working but not like the original one

any suggestions are improvements in the above steps

Congratulations!
Excellent idea to use OpenCV for recovering the vector graphics! I was considering this too occasionally.

You probably do not need to run OCRmyPDF because that also uses Tesseract - like PyMuPDF does.
You could make a Pixmap of the page, OCR it and also pass that image directly to cv2 after converting it to a numpy array (as needed by cv2).

For having PyMuPDF directly use the cv2-detected lines or boundary boxes, you can use Page.find_tables() parameters add_lines / add_boxes. You don’t need to draw them.

  • add_lines accepts a list of tuples (point1, point2) (pymupdf.Point or tuples (x,y)).
  • add_boxes accepts a list of rectangles or rect-like tuples (x0, y0, x1, y1).

Both lists can contain overlapping objects in any sequence.