PDF text extraction from a complex form

Disha · July 18, 2025, 1:55pm

is there any way to get structured json output from complex form - multiple layouts(not xfa based) from pymupdf
ITR3_Notified Form AY 2023-24.pdf (2.2 MB)

Jamie_Lemon · July 18, 2025, 2:44pm

This is indeed a complex form!

To get a JSON representation use:

import pymupdf

doc = pymupdf.open("form.pdf")

# Select a specific page (e.g., the first page)
page = doc[0]

# Get representation as JSON
json = page.get_text("json")

print(f"json: ({json})")

# Close the document
doc.close()

Topic		Replies	Views
How to convert json to pandas dataframe PyMuPDF	1	25	February 20, 2026
Convert a JSON file to a PDF How To	0	40	July 22, 2025
ADding form fields detection inside pymupdf-layout PyMuPDF	2	35	February 9, 2026
To_markdown only producing header tags (and no text), to_json produces correct text from spans PyMuPDF	12	53	May 6, 2026
Pymupdf layout table detection issue PyMuPDF	14	135	February 24, 2026

PDF text extraction from a complex form

Related topics