What is the best way to extract text from a pdf?
The CAM::PDF module is pretty useful for extracting text and maintaining some information about where it came from in the document. It installs /usr/local/bin/getpdftext.pl which demonstrates simple extraction. However, CAM::PDF can only read PDFs that are completely valid.
If you are dealing with ill-formed PDFs, you may need a more lenient parser, such as pdftotext. It dumps foo.pdf to foo.txt, which you could then read into Perl.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With