Hi,
I encountered an issue trying to parse pdfs to markdown using pymupdf-layout.
I want to parse a double column pdf, typical of scientific papers, using pymupdf4llm and pymupdf-layout, but I found out that the text extracted isn’t correctly ordered.
I used the following code:
import io
from pathlib import Path
import pymupdf.layout
import pymupdf4llm
def main() -> None:
ex_pdf_path = Path("ex_pdf_double_column.pdf") # Update path here
buf = io.BytesIO(ex_pdf_path.read_bytes())
with pymupdf.open(stream=buf) as document:
pdf_text = pymupdf4llm.to_markdown(doc=document, header=False, footer=False, use_ocr=False)
output_file = Path.cwd() / f"{ex_pdf_path.stem}_pymupdf_layout_extract.md"
with output_file.open("w", encoding="utf-8") as out_file:
out_file.write(pdf_text)
if __name__ == "__main__":
main()
with the input pdf:
ex_pdf_double_column.pdf (214.4 KB)
And the extracted text is there:
ex_pdf_double_column_pymupdf_layout_extract.md (3.7 KB)
Here is a comparison of the input pdf (left) and the text extraction I obtain (right):
as you can see, I expect the text to be in the order: red, pink, yellow, green, blue, but I get the order: green, red, blue, pink, yellow.
This seems specific to pymupdf-layout as I didn’t observe it with pymupdf.
Libraries versions:
- pymupdf4llm 0.2.9
- pymupdf 1.26.6
- pymupdf-layout 1.26.6
as installed by pip: pip install pymupdf4llm[layout] .
