Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Scraping from PDF and Excel [closed]

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.

Thanks in advance.

like image 880
Sakhawat Avatar asked Jun 30 '10 09:06

Sakhawat


2 Answers

Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

like image 106
renick Avatar answered Nov 08 '22 03:11

renick


I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.

like image 33
Govert Avatar answered Nov 08 '22 02:11

Govert