How to convert the pymupdf4llm.to_json format into a pandas DataFrame while preserving the exact row and column positioning of the contents?
Hi @mir975
- Use the
pymupdf4llm.to_json()method. - The resulting dict / json has the
"pages"key which is a list of one dict per page. - Each page dict contains the key
"boxes"which is a list of the layout boxes identified on the page. - Each layout box has a
"boxclass"key. Its value is"table"for tables. - In that case there also exists the
"table"key with all table-relevant data. For example,table["extract"]is a list of list of the table’s cell values. This can be list can be passed to pandas.
For example:
import sys
from pathlib import Path
import pymupdf.layout
import pymupdf4llm
import json
import pandas
doc = pymupdf.open(sys.argv[1])
out = pymupdf4llm.to_json(doc)
outdict = json.loads(out)
page0 = outdict["pages"][0] # dictionary for page 0
tabboxes = [b for b in page0["boxes"] if b["boxclass"] == "table"]
tab0 = tabboxes[0]["table"] # first table of page 0
extract = tab0["extract"] # list of lists of cell text content
df = pandas.DataFrame(extract[1:], columns=extract[0]) # create DataFrame
print(df)
Gives you this:
Boiling Points °C min max avg
0 Noble gases -269 -62 -170.5
1 Nonmetals -253 4827 414.1
2 Metalloids 335 3900 741.5
3 Metals 357 >5000 2755.9
for this table
