Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting PDF annotations/comments [duplicate]

Tags:

java

python

pdf

We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).

Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?


1 Answers

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.

code

import poppler, os.path

path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]

for page_no, page in enumerate(pages):
    items = [i.annot.get_contents() for i in page.get_annot_mapping()]
    items = [i for i in items if i]
    print "page: %s comments: %s " % (page_no + 1, items)

output

page: 1 comments: ['This is an annotation'] 
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text'] 

installation

On Ubuntu the installation as as follows.

apt-get install python-poppler
like image 153
Marwan Alsabbagh Avatar answered Oct 26 '25 07:10

Marwan Alsabbagh