To_markdown only producing header tags (and no text), to_json produces correct text from spans

Official Copy (Register) - NK92733-extract.pdf (441.5 KB)

I am trying to extract markdown text from this document but it only produces a set of markdown headers with no text. When I try to_json the spans in the document are correctly recognised. I would expect to_markdown to recognise these spans and use them.

Issue looks similar to this https://forum.mupdf.com/t/problem-with-pymupdf4llm-to-markdown/326

Code to reproduce:

import sys
import pymupdf.layout
import pymupdf4llm
import pymupdf


def main(filepath):
    markdown_text = ""
    json_text = ""
    try:
        with pymupdf.open(filepath) as pdf:  
            markdown_text = pymupdf4llm.to_markdown(
                pdf,
                use_ocr=2,
            )
            json_text = pymupdf4llm.to_json(
                pdf,
                use_ocr=2,  
            )

    except Exception as e:
        print(f"Error processing PDF with PyMuPDF: {e}")

    print(markdown_text)
    print(json_text)

    try:
        with open(f"D:\\database\\raw_pymupdf.md", 'w', encoding='utf-8') as f:
            f.write(markdown_text)
        with open(f"D:\\database\\raw_pymupdf.json", 'w', encoding='utf-8') as f:
            f.write(json_text)
    except Exception as e:
        print(f"Error saving content data for raw pdf: {e}")

if __name__ == "__main__":
    # get command line argument for file content
    print(f"{pymupdf.version=}, {pymupdf4llm.version=}")
    if len(sys.argv) < 2:
        print("Usage: python ingestion_helpers.py <filepath>")
        sys.exit(1)
    filepath = sys.argv[1]
    main(filepath)

pymupdf.version=(‘1.27.2.2’, ‘1.27.2’, None), pymupdf4llm.version=‘1.27.2.2’

Any ideas?

BBUK

I should say that straight pymupdf extracts the span text fine (albeit unformatted), the issue only seems to be with pymupdf4llm.to_markdown().

Hmmm, interesting I do get results using 1.27.2.2 albeit the Markdown seems a but wrong:

=== Document parser messages ===
Using RapidOCR and Tesseract for OCR processing.
OCR on page.number=0/1.
OCR on page.number=1/2.
OCR on page.number=2/3.

The electronic official `The electronic official copy of the register follows this message.` copy of the registerPlease note that this is the only official copy we | `Please note that this is the only official copy we will issue. We will not issue a` > will issue.| We will not issueal paper officialfollows `paper official copy.` this message. copy. 

## HM Land Registry HM Land Registry 

## **Official copy** Official copy **of register of** of register of title **title** 

( There is more than that - but there seems some strangeness with tis output to me )

If I do use_ocr=True then I get much more reliable results:

The electronic official copy of the registerPlease note that this is the only official copy we |> will issue.| We will not issueal paper officialfollows this message. copy. 

HM Land Registry 

## Official copy of register of title 

## Title number NK92733 

## Edition date 04.12.2000 

- This official copy shows the entries on the register of title on 31 AUG 2021 at 10:22:51. 

- This date must be quoted as the "search from date" in any official search application based on this copy. 

- The date at the beginning of an entry is the date on which the entry was made in the register ~~.~~ 

- Issued on 31 Aug 2021, 


Can you confirm you have the required OCR engines installed correctly?

Great, thanks.

I don’t have the OCR engines installed and I don’t really want to use OCR in the sample I provided - the correct span text and bbox info is already there! The longer story is that I am using embed_images=True and where there are any images in the pdfs I am processing, I use a specialised LLM-based image classifier. What I would expect is that span text would be extracted to markdown and any real images (here land plans, but not in the provided sample) would then be separately processed.

Do you know if this is achievable?

Thanks again

BBUK

What happens if you use the previous version of pymupdf4llm? pip install pymupdf4llm==1.27.2.1 and then just:

markdown_text = pymupdf4llm.to_markdown(
                pdf
            )

For me I seem to get the MD and without the need for OCRing.

Yup, that pretty much works 100%. Thanks - possibly a regression. The only (possibly unrelated) thing I see is that embed_images does not work as expected for one of my test files:

Kibblesworth Title Plan TY301184.PDF (297.8 KB)

For this file, I would expect the markdown to be extracted with an embedded base64 encoded image representing the last page. The markdown text is fine but see no embedded image. embed_images does, however, work as expected on other files with pages with mixed text and images.

If helpful, I would add that with that file I do get the embedded base64 image as expected if I set pymupdf4llm.use_layout(False)!?!

The thing is that breaks the markdown generation on other files such as the one I first posted.

@BBUK Yes - I see what you mean - I will log a bug with this with the internal team - seems strange that it doesn’t work as expected.

Thank you.

Will keep you updated when I find out more!