I am trying to extract markdown text from this document but it only produces a set of markdown headers with no text. When I try to_json the spans in the document are correctly recognised. I would expect to_markdown to recognise these spans and use them.
Hmmm, interesting I do get results using 1.27.2.2 albeit the Markdown seems a but wrong:
=== Document parser messages ===
Using RapidOCR and Tesseract for OCR processing.
OCR on page.number=0/1.
OCR on page.number=1/2.
OCR on page.number=2/3.
The electronic official `The electronic official copy of the register follows this message.` copy of the registerPlease note that this is the only official copy we | `Please note that this is the only official copy we will issue. We will not issue a` > will issue.| We will not issueal paper officialfollows `paper official copy.` this message. copy.
## HM Land Registry HM Land Registry
## **Official copy** Official copy **of register of** of register of title **title**
( There is more than that - but there seems some strangeness with tis output to me )
If I do use_ocr=True then I get much more reliable results:
The electronic official copy of the registerPlease note that this is the only official copy we |> will issue.| We will not issueal paper officialfollows this message. copy.
HM Land Registry
## Official copy of register of title
## Title number NK92733
## Edition date 04.12.2000
- This official copy shows the entries on the register of title on 31 AUG 2021 at 10:22:51.
- This date must be quoted as the "search from date" in any official search application based on this copy.
- The date at the beginning of an entry is the date on which the entry was made in the register ~~.~~
- Issued on 31 Aug 2021,
Can you confirm you have the required OCR engines installed correctly?
I don’t have the OCR engines installed and I don’t really want to use OCR in the sample I provided - the correct span text and bbox info is already there! The longer story is that I am using embed_images=True and where there are any images in the pdfs I am processing, I use a specialised LLM-based image classifier. What I would expect is that span text would be extracted to markdown and any real images (here land plans, but not in the provided sample) would then be separately processed.
Yup, that pretty much works 100%. Thanks - possibly a regression. The only (possibly unrelated) thing I see is that embed_images does not work as expected for one of my test files:
For this file, I would expect the markdown to be extracted with an embedded base64 encoded image representing the last page. The markdown text is fine but see no embedded image. embed_images does, however, work as expected on other files with pages with mixed text and images.