While I do not have problem extracting the text from a rectangular clip area with pymupdf, I cannot do the same with pymupdf4llm, as suggested in The PyMuPDF4LLM API - PyMuPDF documentation
I can do successfully:
import fitz
rect = fitz.Rect(33, 52, 366, 585)
doc = fitz.open(pdf_file)
…
and for every pages
tp = page.get_textpage(clip=rect)
I cannot do a similar thing with pymupdf4llm, because no clipping is done:
import pymupdf
import pymupdf4llm
pymupdf4llm.use_layout=False
testo_estratto = pymupdf4llm.to_markdown(“Treatise_Book_1.pdf”,margins=(33, 52, 366, 585))
The margins parameter is supported in non-layout mode only. It specifies the width/height of the borders on a page that should be disregarded. So, looking at your example, the values 366 and 585 are definitely incorrect.
Using margins=(72, 72, 72, 72) would ignore 1-inch borders at all four edges.
For layout mode, the only solution I can think of is fiddling with the CropBox of the pages: before each page is processed, do page.set_cropbox(page.rect + (72, 72, -72, -72)).
Roughly this approach: