hi team
for the general pdf (not images)
i am able to extract the table content, and redact the information what ever i needed, it is working fine,
now we have scanned pdf
i used OCR extact the information
full_tp = page.get_textpage_ocr(
flags=0,
dpi=300,
full=True,
language=‘eng’,
tessdata=r"C:\Users\3330\AppData\Local\Programs\Tesseract-OCR\tessdata"
after doing OCR completely lost the layout
how to preserve the layout like tables, page structure while doing OCR, so that i can iterate the table and redact the rows which ever not needed.
redaction is not applied correctly after applying OCR
search_terms = [“Partner Acknowledged”, “0060584486”, “5,555,390.00”, “0.00”]
for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found ‘{term}’ at positions: {rectangles}")
for rect in rectangles:
Mark the area for redaction
page.add_redact_annot(rect)
else:
print(f"‘{term}’ not found on this page.")
page.apply_redactions()
for general pdf everything is working fine.
search_terms = [“Partner Acknowledged”, “0060584486”, “5,555,390.00”, “0.00”]
for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found ‘{term}’ at positions: {rectangles}")
for rect in rectangles:
Mark the area for redaction
page.add_redact_annot(rect)
else:
print(f"‘{term}’ not found on this page.")
page.apply_redactions()
how to tackle below cases while using OCR
1) preserving structure (Tables , layout )
2) applying reduction correctly
