Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I do a full-text search of PDF files from Perl?

I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string. To date I have been using this:

my @search_results = `grep -i -l \"$string\" *.pdf`;

where $string is the text to look for. However this fails for most pdf's because the file format is obviously not ASCII.

What can I do that's easiest?

Clarification: There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.

Final solution using Adam Bellaire's suggestion below:

@search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
like image 604
aurelien Avatar asked Sep 26 '08 12:09

aurelien


People also ask

How do I search for a string in a text file in Perl?

This can be done in multiple ways as per the user's requirement. Searching in Perl follows the standard format of first opening the file in the read mode and further reading the file line by line and then look for the required string or group of strings in each line.

How do I search for a word in Perl?

Simple word matching In this statement, World is a regex and the // enclosing /World/ tells Perl to search a string for a match. The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match.


2 Answers

The PerlMonks thread here talks about this problem.

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
like image 71
Adam Bellaire Avatar answered Oct 01 '22 06:10

Adam Bellaire


My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:

my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
   my $text = $doc->getPageText($pagenum);
   print $text;
}
like image 24
Chris Dolan Avatar answered Oct 01 '22 07:10

Chris Dolan