BUG: double column pdfs text extracted in wrong order

Hi,
I encountered an issue trying to parse pdfs to markdown using pymupdf-layout.
I want to parse a double column pdf, typical of scientific papers, using pymupdf4llm and pymupdf-layout, but I found out that the text extracted isn’t correctly ordered.

I used the following code:

import io
from pathlib import Path

import pymupdf.layout
import pymupdf4llm


def main() -> None:
    ex_pdf_path = Path("ex_pdf_double_column.pdf")  # Update path here

    buf = io.BytesIO(ex_pdf_path.read_bytes())
    with pymupdf.open(stream=buf) as document:
        pdf_text = pymupdf4llm.to_markdown(doc=document, header=False, footer=False, use_ocr=False)

    output_file = Path.cwd() / f"{ex_pdf_path.stem}_pymupdf_layout_extract.md"
    with output_file.open("w", encoding="utf-8") as out_file:
        out_file.write(pdf_text)


if __name__ == "__main__":
    main()

with the input pdf:

ex_pdf_double_column.pdf (214.4 KB)
And the extracted text is there:
ex_pdf_double_column_pymupdf_layout_extract.md (3.7 KB)

Here is a comparison of the input pdf (left) and the text extraction I obtain (right):


as you can see, I expect the text to be in the order: red, pink, yellow, green, blue, but I get the order: green, red, blue, pink, yellow.
This seems specific to pymupdf-layout as I didn’t observe it with pymupdf.

Libraries versions:

  • pymupdf4llm 0.2.9
  • pymupdf 1.26.6
  • pymupdf-layout 1.26.6

as installed by pip: pip install pymupdf4llm[layout] .

Hi @JasmineGMT Thank you for your comprehensive bug report and I can also replicate your issue at my end. I think we will need to investigate how the Layout model is interpreting this PDF and see how to better train it.

I have updated our internal issue board with this bug report and included a link to this post I the report. Hopefully with the next release of pymupdf-layout this problem may be resolved.

Thanks again for taking the time to report this!

1 Like

Also a thank you from me.
I have spotted and fixed the problem in PyMuPDF4LLM. The fix will be available in version 0.3.0.

1 Like