I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe
from command line (i.e using Perl system
function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.
In the Actions palette, double-click or drag the Extract text action from the PDF package. In the PDF path, select one of the following options to specify the location of the PDF: Control Room file: Enables you to select a PDF file that is available in a folder in the Control Room.
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename); my $pageone_tree = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With