I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.
How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)
What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.
The web crawler is basically a program that is mainly used for navigating to the web and finding new or updated pages for indexing. The crawler begins with a wide range of seed websites or popular URLs and searches depth and breadth to extract hyperlinks.
Yes. There are many powerful Java libraries used for web scraping. Two such examples are JSoup and HtmlUnit. These libraries help you connect to a web page and offer many methods to extract the desired information.
Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.
Crawler4j is the best solution for you,
Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!
Also visit. for more java based web crawler tools and brief explanation for each.
This is How your program 'visit' or 'connect' to web pages.
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("http://stackoverflow.com/");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
This will download source of html page.
For HTML parsing see this
Also take a look at jSpider and jsoup
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With