Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Web Crawler Libraries

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.

  1. How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)

  2. What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.

like image 590
CodeKingPlusPlus Avatar asked Jul 01 '12 13:07

CodeKingPlusPlus


People also ask

What is web crawler in Java?

The web crawler is basically a program that is mainly used for navigating to the web and finding new or updated pages for indexing. The crawler begins with a wide range of seed websites or popular URLs and searches depth and breadth to extract hyperlinks.

Can Java be used for web scraping?

Yes. There are many powerful Java libraries used for web scraping. Two such examples are JSoup and HtmlUnit. These libraries help you connect to a web page and offer many methods to extract the desired information.

How do you implement a web crawler?

Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.


2 Answers

Crawler4j is the best solution for you,

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

Also visit. for more java based web crawler tools and brief explanation for each.

like image 63
cuneytykaya Avatar answered Sep 29 '22 07:09

cuneytykaya


This is How your program 'visit' or 'connect' to web pages.

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));

        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }

This will download source of html page.

For HTML parsing see this

Also take a look at jSpider and jsoup

like image 45
Mohammad Adil Avatar answered Sep 29 '22 08:09

Mohammad Adil