On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.
I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf
might become 2012-07-15 Water Bill.pdf
and moved to my Utilities
folder.
pdf-parser.py
by Didier StevensI have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem. This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.
I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With