Issue: Hyperlink extraction from pdf to markdown is not working

I have the following code to convert PDF files into Markdown format, but the hyperlinks present in the PDF are not being preserved or converted correctly into Markdown.

doc = pymupdf.open(file_path)
md_text = pymupdf4llm.to_markdown(
    doc,
    header=False,
    footer=False,
    embed_images=True,
    dpi=300,
    use_ocr=True
)

This feature is not yet supported in PyMuPDF4LLM. Links are a little bit tricky to deal with as they can be internal (linking to other areas of the doc) or external (a website). Many links that you get from Page - PyMuPDF documentation will be invisible rects overlayed on areas of a doc too. For obvious text which is a website kink, you could use maybe post-process the resulting MD and look for any obvious inline text websites. e.g. if the text body contains https:// then figure out how to wrap that with correct markdown.

1 Like