Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawler in Groovy (JSoup VS Crawler4j)

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved.

I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two?

Thanks.

like image 601
clever_bassi Avatar asked Jun 23 '14 17:06

clever_bassi


People also ask

Is Jsoup a crawler?

A Web Crawler is a program that navigates the Web and finds new or updated pages for indexing. The Crawler starts with seed websites or a wide range of popular URLs (also known as the frontier) and searches in depth and width for hyperlinks to extract.


1 Answers

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

like image 87
Alkis Kalogeris Avatar answered Sep 21 '22 15:09

Alkis Kalogeris