Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract text from a PDF file in Perl?

Tags:

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

like image 570
Pawan Rao Avatar asked Jul 16 '09 11:07

Pawan Rao


People also ask

How do I extract specific text from a PDF?

To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.

How do I extract text from a PDF using Automation Anywhere?

In the Actions palette, double-click or drag the Extract text action from the PDF package. In the PDF path, select one of the following options to specify the location of the PDF: Control Room file: Enables you to select a PDF file that is available in a folder in the Control Room.


1 Answers

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);    my $pageone_tree = $pdf->getPageContentTree(1);    print CAM::PDF::PageText->render($pageone_tree); 

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

like image 138
joe Avatar answered Oct 23 '22 23:10

joe