Why is this graphic NOT extracted as images by pymupdf4llm.to_markdown(write_images=True)

I am using pymupdf4llm to extract patient leaflet information inside PDF documents from the European Medicines Agency. Inside the PDF file there is a small transparent graphic showing a blue triangle with an explanation mark inside. For some reason this is not exported and not referred to in the resulting md file. Other graphics work fine. Obviously this is important safety information.
Does anybody have an idea why?
The file is : https://www.ema.europa.eu/en/documents/product-information/fiasp-epar-product-information_en.pdf (The first missing blue triangle is on page 61)
And the code I use the the one below.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(“/pleaflet/samples/fiasp-epar-product-information_en.pdf”, write_images=True, force_text=False)

now work with the markdown text, e.g. store as a UTF8-encoded file

import pathlib
pathlib.Path(“noutput.md”).write_bytes(md_text.encode())

I think this is to do with the image_size_limit parameter as defined here:

  • image_size_limit (float) – this must be a positive value less than 1. Images are ignored if width / page.rect.width <= image_size_limit or height / page.rect.height <=image_size_limit. For instance, the default value 0.05 means that to be considered for inclusion, an image’s width and height must be larger than 5% of the page’s width and height, respectively.

So I think if you define this as 0 then you should see it in the output.

1 Like

Thanks a lot for your quick answer! Setting the parameter to 0 solved my problem. Sorry this was a RTFM problem! I will now check your documentation link! Thanks again for your help!

No worries - happy coding! :slight_smile:

1 Like