Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract PDF fields from a filled out form in Python?

Tags:

python

forms

pdf

I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I've tried:

  • The pdfminer demo: it didn't dump any of the filled out data.
  • pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
  • Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.


Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.

from argparse import ArgumentParser import pickle import pprint from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1, PDFObjRef  def load_form(filename):     """Load pdf form contents into a nested list of name/value tuples"""     with open(filename, 'rb') as file:         parser = PDFParser(file)         doc = PDFDocument(parser)         return [load_fields(resolve1(f)) for f in                    resolve1(doc.catalog['AcroForm'])['Fields']]  def load_fields(field):     """Recursively load form fields"""     form = field.get('Kids', None)     if form:         return [load_fields(resolve1(f)) for f in form]     else:         # Some field types, like signatures, need extra resolving         return (field.get('T').decode('utf-16'), resolve1(field.get('V')))  def parse_cli():     """Load command line arguments"""     parser = ArgumentParser(description='Dump the form contents of a PDF.')     parser.add_argument('file', metavar='pdf_form',                     help='PDF Form to dump the contents of')     parser.add_argument('-o', '--out', help='Write output to file',                       default=None, metavar='FILE')     parser.add_argument('-p', '--pickle', action='store_true', default=False,                       help='Format output for python consumption')     return parser.parse_args()  def main():     args = parse_cli()     form = load_form(args.file)     if args.out:         with open(args.out, 'w') as outfile:             if args.pickle:                 pickle.dump(form, outfile)             else:                 pp = pprint.PrettyPrinter(indent=2)                 file.write(pp.pformat(form))     else:         if args.pickle:             print(pickle.dumps(form))         else:             pp = pprint.PrettyPrinter(indent=2)             pp.pprint(form)  if __name__ == '__main__':     main() 
like image 514
Olson Avatar asked Oct 21 '10 03:10

Olson


People also ask

How do I extract data from a fillable PDF?

In Acrobat, open the completed form file. In the right hand pane, choose More > Export Data. In the Export Form Data As dialog box, select the format in which you want to save the form data (FDF, XFDF, XML, or TXT). Then select a location and filename, and click Save.

How do I extract data from a PDF in Python?

All we need to do is use PyPDF2 to access the XML document from the object structure of this file. Once we have access to the XML, it is a simple exercise of parsing out the XML document to access values for various form elements, which could then be stored into a Python list, Numpy array, Pandas dataframe etc.

How extract extract specific text from PDF file in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

How do I extract field names from a PDF?

To extract fillable fields in a PDF, select a completed document as a template and click Extract in Bulk on the right pane. Define the fields with data you would like to extract. Click Add New Data Field in the upper right corner and draw a rectangle around the data you'd like to extract.


2 Answers

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1  filename = sys.argv[1] fp = open(filename, 'rb')  parser = PDFParser(fp) doc = PDFDocument(parser) fields = resolve1(doc.catalog['AcroForm'])['Fields'] for i in fields:     field = resolve1(i)     name, value = field.get('T'), field.get('V')     print '{0}: {1}'.format(name, value) 

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

like image 151
Steven Avatar answered Oct 02 '22 17:10

Steven


Python 3.6+:

pip install PyPDF2

# -*- coding: utf-8 -*-  from collections import OrderedDict from PyPDF2 import PdfFileWriter, PdfFileReader   def _getFields(obj, tree=None, retval=None, fileobj=None):     """     Extracts field data if this PDF contains interactive form fields.     The *tree* and *retval* parameters are for recursive use.      :param fileobj: A file object (usually a text file) to write         a report to on all interactive form fields found.     :return: A dictionary where each key is a field name, and each         value is a :class:`Field<PyPDF2.generic.Field>` object. By         default, the mapping name is used for keys.     :rtype: dict, or ``None`` if form data could not be located.     """     fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',                        '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}     if retval is None:         retval = OrderedDict()         catalog = obj.trailer["/Root"]         # get the AcroForm tree         if "/AcroForm" in catalog:             tree = catalog["/AcroForm"]         else:             return None     if tree is None:         return retval      obj._checkKids(tree, retval, fileobj)     for attr in fieldAttributes:         if attr in tree:             # Tree is a field             obj._buildField(tree, retval, fileobj, fieldAttributes)             break      if "/Fields" in tree:         fields = tree["/Fields"]         for f in fields:             field = f.getObject()             obj._buildField(field, retval, fileobj, fieldAttributes)      return retval   def get_form_fields(infile):     infile = PdfFileReader(open(infile, 'rb'))     fields = _getFields(infile)     return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())    if __name__ == '__main__':     from pprint import pprint      pdf_file_name = 'FormExample.pdf'      pprint(get_form_fields(pdf_file_name)) 
like image 41
dvska Avatar answered Oct 02 '22 18:10

dvska