How to access pdf containers BDC/EMC and BMC/EMC

Is there a dedicated functions to access pdf containers marked with BDC/EMC and BMC/EMC. I try to markup text lines and later get access to them.

Pdf I create from latex (via Postscript pdfmark):

\special {ps: [ /Line1 << /Pos /Left /Page (1) >> /BDC pdfmark}
1
\special{ps: [ /EMC pdfmark}

In pdf I see that the lines are marked (eg. /Line1/R13 BDC)

/Line1/R13 BDC
q
10 0 0 10 0 0 cm BT
/R14 9.96264 Tf
1 0 0 1 148.716 645.282 Tm
[(1)-0.469283]TJ
ET
Q
EMC
....
....
%% Original object ID: 17 0
20 0 obj
<<
  /Page (1)
  /Pos /Left
>>
endobj

In BDC dictionary I store page and line number. How can easy get position in the page of marked content I know page and marked content name (for example page 1 /Line1 )? Like now there is doc.resolve_names() which sounds like should return some objects names but is dedicated only for destinations (why not doc.get_destinations() then)

import fitz  
doc = fitz.open("qbody.pdf")

for page in doc:
    for cont in page.get_contents():
        print (doc.xref_stream(cont))

names = doc.resolve_names()        
print (names)

Any hint would be appreciated.

qbody.pdf (5.3 KB)

Hi @Linas,

No there is no such access.
You must access the /Contents of the page and hack your way through it.
And you would have to decipher the /Properties object of the page to look up objects like R13 = << /Page (1) /Pos /Left >> etc.

You could use PyMuPDF’s submodule mupdf for easy access of the page object’s dictionaries. Also quite hacky … but possible if you know what you are doing:

mupdf = pymupdf.mupdf  # sub-module
pdfpage = pymupdf._as_pdf_page(page)  # underlying PDF page

# step through the resources to access /Properties
resources = mupdf.pdf_dict_get(pdfpage.obj(), pymupdf.PDF_NAME("Resources"))

# now the Properties object:
props = resources.pdf_dict_get(pymupdf.PDF_NAME("Properties"))

# iterate the properties sub dicts:
for i in range(props.pdf_dict_len()):
    k = props.pdf_dict_get_key(i)
    v = props.pdf_dict_get_val(i)
    print(k.pdf_to_name(), doc.xref_object(v.pdf_to_num(), compressed=True))

    
R17 <</Page(1)/Pos/Left>>
R19 <</Page(1)/Pos/Left>>
R13 <</Page(1)/Pos/Left>>

Of course you could continue and cleanly extract the /Line and /Pos values … as opposed to retrieving the object’s string as I did.

1 Like