How can I extract text from a PDF file in Perl?

Tags:

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

570

asked Jul 16 '09 11:07

Pawan Rao

1 Answers

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);    my $pageone_tree = $pdf->getPageContentTree(1);    print CAM::PDF::PageText->render($pageone_tree);

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

138

answered Oct 23 '22 23:10

joe

Related questions
                            
                                Unbuffered read from process using subprocess in Python
                            
                                Boxing / Unboxing Nullable Types - Why this implementation?
                            
                                Disable or grey out a node in the TreeNode Editor
                            
                                How to access the Picasa (desktop) database?
                            
                                make wildcard subdirectory targets
                            
                                What is the purpose of deepcopy's second parameter, memo?
                            
                                When using pdfpages in LaTeX, how to avoid page breaks before the first page ?
                            
                                How do I generate a uniform random integer partition?
                            
                                Why does params behave like this?
                            
                                F# code organization: types & modules
                            
                                "Metadata information not found" while using EF4's POCO Template?
                            
                                git: put a branch in a subdirectory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I extract text from a PDF file in Perl?

Tags:

Pawan Rao

People also ask

1 Answers

joe

Recent Activity

Donate For Us