BUG: double column pdfs text extracted in wrong order

JasmineGMT · January 13, 2026, 4:12pm

Hi,
I encountered an issue trying to parse pdfs to markdown using pymupdf-layout.
I want to parse a double column pdf, typical of scientific papers, using pymupdf4llm and pymupdf-layout, but I found out that the text extracted isn’t correctly ordered.

I used the following code:

import io
from pathlib import Path

import pymupdf.layout
import pymupdf4llm


def main() -> None:
    ex_pdf_path = Path("ex_pdf_double_column.pdf")  # Update path here

    buf = io.BytesIO(ex_pdf_path.read_bytes())
    with pymupdf.open(stream=buf) as document:
        pdf_text = pymupdf4llm.to_markdown(doc=document, header=False, footer=False, use_ocr=False)

    output_file = Path.cwd() / f"{ex_pdf_path.stem}_pymupdf_layout_extract.md"
    with output_file.open("w", encoding="utf-8") as out_file:
        out_file.write(pdf_text)


if __name__ == "__main__":
    main()

with the input pdf:

ex_pdf_double_column.pdf (214.4 KB)
And the extracted text is there:
ex_pdf_double_column_pymupdf_layout_extract.md (3.7 KB)

Here is a comparison of the input pdf (left) and the text extraction I obtain (right):

as you can see, I expect the text to be in the order: red, pink, yellow, green, blue, but I get the order: green, red, blue, pink, yellow.
This seems specific to pymupdf-layout as I didn’t observe it with pymupdf.

Libraries versions:

pymupdf4llm 0.2.9
pymupdf 1.26.6
pymupdf-layout 1.26.6

as installed by pip: pip install pymupdf4llm[layout] .

Jamie_Lemon · January 14, 2026, 5:27pm

Hi @JasmineGMT Thank you for your comprehensive bug report and I can also replicate your issue at my end. I think we will need to investigate how the Layout model is interpreting this PDF and see how to better train it.

I have updated our internal issue board with this bug report and included a link to this post I the report. Hopefully with the next release of pymupdf-layout this problem may be resolved.

Thanks again for taking the time to report this!

HaraldLieder · January 16, 2026, 2:55pm

Also a thank you from me.
I have spotted and fixed the problem in PyMuPDF4LLM. The fix will be available in version 0.3.0.

Topic		Replies	Views
Bug: pymupdf4llm: mis-interpreted layout and IndexError on specific pages (insurance policy PDF) PyMuPDF	5	42	January 6, 2026
Pymupdf layout table detection issue PyMuPDF	14	108	February 24, 2026
Pymupdf4llm unexpected reordering of output after v0.0.17 PyMuPDF	1	42	September 19, 2025
BUG: pymupdf4llm list index out of range in document_layout.py (2) PyMuPDF	3	49	December 4, 2025
BUG: list index out of range using new layout feature PyMuPDF	16	89	December 11, 2025

BUG: double column pdfs text extracted in wrong order

Related topics