Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Script to search for text from PDF

Problem

On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.

Background

I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.

What have I done so far?

  • I have searched for PDF-to-plain-text tools, but not found anything yet
  • I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
  • I am looking into pdf-parser.py by Didier Stevens
  • I heard of a Python package called pyPdf and will look at it next.

Update

I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem. This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

like image 761
Hai Vu Avatar asked Nov 13 '22 01:11

Hai Vu


1 Answers

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

like image 140
TrojanName Avatar answered Dec 14 '22 17:12

TrojanName