Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract data from a PDF?

My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database?

Would this require writing an app or is there an automated way of doing this?

like image 904
Fermin Avatar asked Jul 07 '09 11:07

Fermin


People also ask

Is it possible to scrape data from PDF?

Docparser is a PDF scraper software that allows you to automatically pull data from recurring PDF documents on scale. Like web-scraping (collecting data by crawling the internet), scraping PDF documents is a powerful method to automatically convert semi-structured text documents into structured data.


2 Answers

As already mentioned - you will have to write an app to do this, but ideally you would be able to get the raw data from the external company rather than having to process the PDF.

However, if you do want to extract the data from the PDF, I've used iText and found it to be very powerful, reliable and most importantly - free. It comes in Java and .Net flavours - iTextSharp is the .Net version. It allows you to programatically manipulate PDF documents and it will expose the contents of the PDF to the application that you write.

like image 130
Chris B Avatar answered Sep 23 '22 15:09

Chris B


It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here:

  1. The data is just a text object within a PDF. You'll need to use a tool to extract the text from the PDF then insert it into your database.

  2. The data is contained within form fields in a PDF. You'll need to use a tool to extract data from the form fields and insert it into your database.

Hopefully scenario #2 applies to you because this is precisely what PDF forms are designed for. Scenario #1 is really just a hack that you'd only use if you didn't have any other options. Extracting plain text from a PDF isn't as easy or accurate as you might expect.

If you're receiving a PDF form then all you need to do is match up the right fields in the PDF form with the corresponding fields in your database and then suck in the data. This process could be entirely automated if you wrote your own application.

Would this require writing an app or is there an automated way of doing this?

Yes, both of these options would require writing an app or buying an app. If you write your own app then you'll need to find a third-party PDF library that supports retrieving data from form fields or extracting text from a PDF.

like image 24
Rowan Avatar answered Sep 21 '22 15:09

Rowan