Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF Form Field Manipulation

Tags:

python

pdf

django

I'm making a web interface to autofill pdf forms with user data from a database. The admin needs to be able to upload a pdf (right now targeted at IRS pdf forms) and then associate the fields in the pdf with data fields in the database.

I need a way to help the admin associate the field names (stuff like "topmostSubform[0].Page2[0].p2-t66[0]") with the the data fields in the database. I'm looking for a way to modify the PDF programatically to in some way provide this information.

Basically I'm open to suggestions on how I might make the field names appear in an obvious manner on a modified version of the original pdf. The closest I've gotten is being able to insert Tooltips into the fields in the pdf by just editting the raw pdf line by line. However when editting the pdf in this manner the field names are gibberish, and so I can't just use them.

An optimal solution would be anything that could automatically parse a pdf and set each field's tooltip to be the fields name. Anything that can be run from the command line, or any python tool, or just a basic how to correctly parse a field's name from a raw pdf file would be amazing.

like image 748
John Avatar asked Apr 06 '10 21:04

John


People also ask

How do I edit a PDF form field?

To edit a single form field, double-click it or right-click it and choose Properties. To edit multiple form fields, select the fields that you want to edit, right-click one of the selected fields, and choose Properties.

How do I change the alignment on a fillable PDF?

Go to Prepare Form > Edit > Right Click on the Text Field > Properties > Choose Option tab > Alignment: Left and add checkmark to "Multi-Line" below.


2 Answers

There may be an easier solution than this, but you could definitely get the job done with http://www.reportlab.com/software/opensource/rl-toolkit/'>ReportLab.

If you can save the current tax forms as an image, you could determine where each of the items need to be written and develop your code so that it automatically layers the appropriate values from the database on top of the image (the tax form, or whatever it might be).

Once you've determined 1) What fields need to be pulled from the database, and 2) where they 're supposed to go on within the form...

this is essentially what you'd be doing:

from reportlab.pdfgen import canvas 

report_string_values = ['Alex',500,500],['Guido',400,400],
c = canvas.Canvas('hello.pdf')
c.drawImage(background_image,x_pos,y_pos) # x_pos and w_pos are # pixels from bl origin
for rsv in report_string_values:  
    c.drawString(rsv.x_pos,rsv.,rsv.text) 
c.showPage()
c.save()
like image 154
damzam Avatar answered Oct 20 '22 21:10

damzam


This may be way off your intended track; but, it might be worth a think. I've been working on parsing scanned structured documents into Django model instances. Using tesseract and unpaper to do the pre-processing and OCR, I get over 99% accuracy. That lets me parse the OCR output text with the Levenshtein and re modules and do a simple new_instance = MyModel(parsed1, parsed2, ...).

It seems that you are trying to do something similar. Looking at the forms at http://www.irs.gov/formspubs/ They tend to have text labels left-adjacent to the fields. Using something like py-tesseract, you should be able to OCR the labels, overlay the OCR text over the form image and allow the user to select/edit the field labels.

There is a nice little tool, ocrfeeder https://live.gnome.org/OCRFeeder, that is written in python and should give you a basic idea of how the process works in a desktop app. Good luck.

like image 24
justinzane Avatar answered Oct 20 '22 21:10

justinzane