How to extract text from a PDF? [closed]

People also ask

How do I Copy from a closed PDF?

Select your desired text from PDF and right-click to choose the "Copy" option or press the "Ctrl +C" keys to copy the texts. You are also able to edit PDF text if you need it.

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.

An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.

http://en.wikipedia.org/wiki/Pdftotext

Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.

Here is my suggestion. If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:

https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get

Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.

I hope it helps.

PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).

It is available in Java and C#.

Alternatively, you should have a look at Apache PDFBox, open source.

Related questions
                            
                                Download and open PDF file using Ajax
                            
                                How to Display blob (.pdf) in an AngularJS app
                            
                                Which one is the best PDF-API for PHP? [closed]
                            
                                How to convert webpage into PDF by using Python
                            
                                How can I visually inspect a PDF? Are there any tools that work on windows? [closed]
                            
                                How to find out which fonts are referenced and which are embedded in a PDF document
                            
                                Convert PDF to clean SVG? [closed]
                            
                                How can I display a pdf document into a Webview?
                            
                                Render HTML to PDF in Django site
                            
                                Show a PDF files in users browser via PHP/Perl
                            
                                How to return PDF to browser in MVC?
                            
                                Duplicate headers received from server
                            
                                How to make PDF file downloadable in HTML link?
                            
                                Add text to Existing PDF using Python
                            
                                Extract a page from a pdf as a jpeg
                            
                                Converting HTML files to PDF [closed]
                            
                                How to create PDFs in an Android app? [closed]
                            
                                How to create PDF files in Python [closed]
                            
                                What is the smallest possible valid PDF?
                            
                                PDFtk Server on OS X 10.11

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from a PDF? [closed]

Tags:

text

pdf

text-extraction

extraction

ghostscript

People also ask

Recent Activity

Donate For Us