I have a pdf
file including form fields and need to export the data into a xml
file AUTOMATICALLY. Here is a screen of a sample form I created for testing:
Note: It works great exporting it MANUALLY using Acrobat Professional by clicking on Tools > Form > Export Form Data
and finally chose xml extension for file output. This is the result I'm getting when I export it manually:
<?xml version="1.0" encoding="UTF-8"?>
<fields>
<first_name>John</first_name>
<last_name>Doe</last_name>
</fields>
However, I need to automate it, e.g. with a python script, Java implementation or some command line tools. Any ideas which libraries or tools I could use to export form field data to xml
? The tool or library should be open source, that I can integrate it in my workflow.
I already tried python pdfminer
library, which helped me to export static parts (like Static form header
, First name:
and Last name:
) of the pdf file: But how to export form field data (in my case the content of the form fields first_name
and last_name
)??
EDIT: Feel free to download the sample.pdf file here.
How about Apache PDFBox? It is open source and could fit your needs, since the website says "Extract forms data from PDF forms or prefill a PDF form."
EDIT: Check out the PrintFields example.
In bash, you can do this (at least with my version of these tools, less 444 and cat 8.13):
less ~/Downloads/sample.pdf | cat
I get output that looks like this:
Static form header
First name: John
Last name: Doe
Which you can then parse pretty obviously using Java/Python/awk/whatever.
Of course, alternatively, if you don't want to rely on the behavior of particular versions of these (not sure if they always do this or not), you can look up less's source code to see how it does it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With