Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the easiest way to extract data from a PDF?

Tags:

java

pdf

I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.

I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.

I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.

Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.

Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.

like image 740
Sebastian Fork Avatar asked Jul 26 '11 14:07

Sebastian Fork


1 Answers

I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.

The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.

The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:

public void parse(
   InputStream stream, ContentHandler handler,
   Metadata metadata, ParseContext context)
   throws IOException, SAXException, TikaException {

   metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
   metadata.set("Hello", "World");

   XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
   xhtml.startDocument();
   xhtml.endDocument();
}

Then, to run the parser, you could do something like this:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
like image 147
Kyle Avatar answered Nov 15 '22 16:11

Kyle