Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the data from a website using java?

I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

like image 951
giri Avatar asked Jan 11 '10 18:01

giri


2 Answers

What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

I would recommend using a good error handling html parser like Tagsoup to extract from the HTML exactly what you're looking for.

like image 97
lucas Avatar answered Sep 27 '22 21:09

lucas


You definitely need a good parser like NekoHTML.

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

like image 27
Alex Dean Avatar answered Sep 27 '22 23:09

Alex Dean