I am doing a little data scraping, There are 3 types of file from which i am scraping data.
1- HTML
2- PDF
3- Excel(xls)
For HTML i am comfortable, i am using HTML Agility for that.
For PDF and excel i need suggestions from anyone.
Thanks in advance.
Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.
EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM
I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With