Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best perl module to extract text from a pdf? [closed]

What is the best way to extract text from a pdf?

like image 256
user_78361084 Avatar asked Jan 19 '11 00:01

user_78361084


1 Answers

The CAM::PDF module is pretty useful for extracting text and maintaining some information about where it came from in the document. It installs /usr/local/bin/getpdftext.pl which demonstrates simple extraction. However, CAM::PDF can only read PDFs that are completely valid.

If you are dealing with ill-formed PDFs, you may need a more lenient parser, such as pdftotext. It dumps foo.pdf to foo.txt, which you could then read into Perl.

like image 163
Phssthpok Avatar answered Sep 28 '22 00:09

Phssthpok