I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup. Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered: <blockquote> Crawler4j is a crawler, Jsoup is a parser. </blockquote> But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content. Thus, can you, please, clarify the difference between Crawler4j and Jsoup?

Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like <code>Crawler4J</code>. Let's take a look at an example. Assume you want to crawl a website. The requirements would be: <ol> <li>Give base URI (home page)</li> <li>Take all the URIs from each page and retrieve the contents of those too.</li> <li>Move recursively for every URI you retrieve.</li> <li>Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).</li> <li>Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the <code>About</code> page has a link for the <code>Home</code> page, but we already got the contents of <code>Home</code> page so don't visit it again). </li> <li>The crawling operation must be multithreaded</li> <li>The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from <code>Home</code> page. </li> </ol> This is a simple scenario. Try solving this with <code>Jsoup</code>. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. <code>Jsoup</code>'s strong qualities shine when you get to decide what to do with the content. Let's take a look at some requirements for parsing. <ol> <li>Get all paragraphs of a page</li> <li>Get all images</li> <li>Remove invalid tags (tags that do not comply to the <code>HTML</code> specs)</li> <li>Remove script tags</li> </ol> This is where <code>Jsoup</code> comes to play. Of course, there is some overlapping here. Some things might be possible with both <code>Crawler4J</code> or <code>Jsoup</code>, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from <code>Jsoup</code> and still be an amazing tool to use. If <code>Crawler4J</code> would remove the retrieval, then it would lose half of its functionality. I used both of them in the same project in a real life scenario. I crawled a site, leveraging the strong points of <code>Crawler4J</code>, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to <code>Jsoup</code>, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality. Hence the difference, <code>Crawler4J</code> is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex <code>CSS</code> queries. <code>Jsoup</code> is a parser that gives you a simple API for <code>HTTP</code> requests. For anything more complex there is no implementation.

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

Tags:

java

html-parsing

web-crawler

jsoup

crawler4j

I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup.

Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered:

Crawler4j is a crawler, Jsoup is a parser.

But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content.

Thus, can you, please, clarify the difference between Crawler4j and Jsoup?

933

asked Jan 19 '16 22:01

Mike

1 Answers

Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J.

Let's take a look at an example. Assume you want to crawl a website. The requirements would be:

Give base URI (home page)
Take all the URIs from each page and retrieve the contents of those too.
Move recursively for every URI you retrieve.
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the About page has a link for the Home page, but we already got the contents of Home page so don't visit it again).
The crawling operation must be multithreaded
The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from Home page.

This is a simple scenario. Try solving this with Jsoup. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. Jsoup's strong qualities shine when you get to decide what to do with the content.

Let's take a look at some requirements for parsing.

Get all paragraphs of a page
Get all images
Remove invalid tags (tags that do not comply to the HTML specs)
Remove script tags

This is where Jsoup comes to play. Of course, there is some overlapping here. Some things might be possible with both Crawler4J or Jsoup, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from Jsoup and still be an amazing tool to use. If Crawler4J would remove the retrieval, then it would lose half of its functionality.

I used both of them in the same project in a real life scenario. I crawled a site, leveraging the strong points of Crawler4J, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to Jsoup, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.

Hence the difference, Crawler4J is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS queries. Jsoup is a parser that gives you a simple API for HTTP requests. For anything more complex there is no implementation.

125

answered Sep 30 '22 11:09

Alkis Kalogeris

Related questions
                            
                                What does ArrayList actually stores - References to objects or actual objects?
                            
                                Difference between isEmpty() and zero length
                            
                                Java compare unordered ArrayLists
                            
                                JavaFx :Default Message for Empty ListView
                            
                                Sort an (Array)List with a specific order
                            
                                How to match tab and newline but not space with REGEX?
                            
                                SEVERE: Unable to create initial connections of pool - tomcat 7 with context.xml file
                            
                                org.hibernate.exception.SQLGrammarException: could not prepare statement
                            
                                Remove from HashMap if a key is not in the list
                            
                                Autowire a string from Spring @Configuration class?
                            
                                Websphere MQ v8 - MQRC_NOT_AUTHORIZED - 2035
                            
                                Java - String splits by every character
                            
                                ActiveMQ setup - Unable to send the message to Queue (error - java.io.IOException: Unknown data type: 47)
                            
                                Java Control Panel and command line show different Java 1.7 versions on Mac OS X 10.9.5. What's up?
                            
                                Android: startRecording() called on an uninitialized AudioRecord when SAMPLERATE set to 44100
                            
                                webupd8 JAVA_HOME not set after installing oracle-java8-set-default
                            
                                Java Selenium Chromedriver.exe Does not Exist IllegalStateException
                            
                                Creating a Christmas Tree using for loops
                            
                                Spring autowire interface
                            
                                Unmarshal JSON to Java POJO in JAX-RS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With