Hello, I am looking for the best way to convert tables extracted from a PDF into a pandas DataFrame.
Using Markdown format gives very good results, but I haven’t found a way to convert Markdown to a pandas DataFrame.
This worked for me, although I am using a very simplistic Markdown table in the example here:
import pandas as pd
from io import StringIO
md_table = """
| Name | Age | City |
|-------|-----|----------|
| Alice | 30 | New York |
| Bob | 25 | London |
"""
# Strip the separator row and pipe characters
lines = [l for l in md_table.strip().split('\n') if not set(l.strip()) <= set('|-: ')]
cleaned = '\n'.join(l.strip().strip('|') for l in lines)
df = pd.read_csv(StringIO(cleaned), sep=r'\s*\|\s*', engine='python')
print(df)