I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.
I've tried:
I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.
Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.
from argparse import ArgumentParser import pickle import pprint from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1, PDFObjRef def load_form(filename): """Load pdf form contents into a nested list of name/value tuples""" with open(filename, 'rb') as file: parser = PDFParser(file) doc = PDFDocument(parser) return [load_fields(resolve1(f)) for f in resolve1(doc.catalog['AcroForm'])['Fields']] def load_fields(field): """Recursively load form fields""" form = field.get('Kids', None) if form: return [load_fields(resolve1(f)) for f in form] else: # Some field types, like signatures, need extra resolving return (field.get('T').decode('utf-16'), resolve1(field.get('V'))) def parse_cli(): """Load command line arguments""" parser = ArgumentParser(description='Dump the form contents of a PDF.') parser.add_argument('file', metavar='pdf_form', help='PDF Form to dump the contents of') parser.add_argument('-o', '--out', help='Write output to file', default=None, metavar='FILE') parser.add_argument('-p', '--pickle', action='store_true', default=False, help='Format output for python consumption') return parser.parse_args() def main(): args = parse_cli() form = load_form(args.file) if args.out: with open(args.out, 'w') as outfile: if args.pickle: pickle.dump(form, outfile) else: pp = pprint.PrettyPrinter(indent=2) file.write(pp.pformat(form)) else: if args.pickle: print(pickle.dumps(form)) else: pp = pprint.PrettyPrinter(indent=2) pp.pprint(form) if __name__ == '__main__': main()
In Acrobat, open the completed form file. In the right hand pane, choose More > Export Data. In the Export Form Data As dialog box, select the format in which you want to save the form data (FDF, XFDF, XML, or TXT). Then select a location and filename, and click Save.
All we need to do is use PyPDF2 to access the XML document from the object structure of this file. Once we have access to the XML, it is a simple exercise of parsing out the XML document to access values for various form elements, which could then be stored into a Python list, Numpy array, Pandas dataframe etc.
Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.
To extract fillable fields in a PDF, select a completed document as a template and click Extract in Bulk on the right pane. Define the fields with data you would like to extract. Click Add New Data Field in the upper right corner and draw a rectangle around the data you'd like to extract.
You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").
This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)
import sys from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1 filename = sys.argv[1] fp = open(filename, 'rb') parser = PDFParser(fp) doc = PDFDocument(parser) fields = resolve1(doc.catalog['AcroForm'])['Fields'] for i in fields: field = resolve1(i) name, value = field.get('T'), field.get('V') print '{0}: {1}'.format(name, value)
EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()
Python 3.6+:
pip install PyPDF2
# -*- coding: utf-8 -*- from collections import OrderedDict from PyPDF2 import PdfFileWriter, PdfFileReader def _getFields(obj, tree=None, retval=None, fileobj=None): """ Extracts field data if this PDF contains interactive form fields. The *tree* and *retval* parameters are for recursive use. :param fileobj: A file object (usually a text file) to write a report to on all interactive form fields found. :return: A dictionary where each key is a field name, and each value is a :class:`Field<PyPDF2.generic.Field>` object. By default, the mapping name is used for keys. :rtype: dict, or ``None`` if form data could not be located. """ fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name', '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'} if retval is None: retval = OrderedDict() catalog = obj.trailer["/Root"] # get the AcroForm tree if "/AcroForm" in catalog: tree = catalog["/AcroForm"] else: return None if tree is None: return retval obj._checkKids(tree, retval, fileobj) for attr in fieldAttributes: if attr in tree: # Tree is a field obj._buildField(tree, retval, fileobj, fieldAttributes) break if "/Fields" in tree: fields = tree["/Fields"] for f in fields: field = f.getObject() obj._buildField(field, retval, fileobj, fieldAttributes) return retval def get_form_fields(infile): infile = PdfFileReader(open(infile, 'rb')) fields = _getFields(infile) return OrderedDict((k, v.get('/V', '')) for k, v in fields.items()) if __name__ == '__main__': from pprint import pprint pdf_file_name = 'FormExample.pdf' pprint(get_form_fields(pdf_file_name))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With