Viswa
1
Hi,
I am trying to create a markdown from PDF and issue happens to images that are embedded within a table.
PDF I am trying to extract: https://cars.tatamotors.com/content/dam/tml/pv/general/service/owners-manual/pdf/harrier/harrier-bs6-owners-manual-april-2026.pdf
Refer to Pages: 64, 65
The images in Pictogram column is not being extracted
I went through the forum and I tried with different options by setting image_size_limit=0, ignore_graphics=False, but still none of them is working.
import pymupdf4llm
FILE = "harrier-bs6-owners-manual-april-2026.pdf"
md_text = pymupdf4llm.to_markdown(FILE, pages=63, header=False, footer=False, embed_images=True, image_size_limit=0, ignore_graphics=False)
output = open("out-markdown.md", "w")
output.write(md_text)
output.close()
Welcome to the Forum @Viswa !
Images, hyperlinks and vector graphics inside table cells are currently out of scope - sorry.
Viswa
3
Thanks for the quick reply. Is there any plan for this feature to included in later releases which I could look for?
Also, is there a way for identifying there is an image but not extracted from the table ?
We do intend to support this, but there exists no schedule yet: our list of planned enhancements is loooong
.
But you can easily determine whether there exist image(s) in side any region on the page, e.g. also inside a (table or cell or whatever) bbox:
images = page.get_image_info() # list of images on page (metadata only)
images_in_bbox = [img for img in images if img["bbox"] in pymupdf.Rect(bbox)]