How to convert json to pandas dataframe

How to convert the pymupdf4llm.to_json format into a pandas DataFrame while preserving the exact row and column positioning of the contents?

Hi @mir975

  1. Use the pymupdf4llm.to_json() method.
  2. The resulting dict / json has the "pages" key which is a list of one dict per page.
  3. Each page dict contains the key "boxes" which is a list of the layout boxes identified on the page.
  4. Each layout box has a "boxclass" key. Its value is "table" for tables.
  5. In that case there also exists the "table" key with all table-relevant data. For example, table["extract"] is a list of list of the table’s cell values. This can be list can be passed to pandas.

For example:

import sys
from pathlib import Path
import pymupdf.layout
import pymupdf4llm
import json
import pandas


doc = pymupdf.open(sys.argv[1])
out = pymupdf4llm.to_json(doc)
outdict = json.loads(out)
page0 = outdict["pages"][0]  # dictionary for page 0
tabboxes = [b for b in page0["boxes"] if b["boxclass"] == "table"]
tab0 = tabboxes[0]["table"]  # first table of page 0
extract = tab0["extract"]  # list of lists of cell text content

df = pandas.DataFrame(extract[1:], columns=extract[0])  # create DataFrame
print(df)

Gives you this:

  Boiling Points °C   min    max     avg
0       Noble gases  -269    -62  -170.5
1         Nonmetals  -253   4827   414.1
2        Metalloids   335   3900   741.5
3            Metals   357  >5000  2755.9

for this table