Crawler in Groovy (JSoup VS Crawler4j)

Tags:

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved.

I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two?

Thanks.

601

asked Jun 23 '14 17:06

clever_bassi

1 Answers

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

answered Sep 21 '22 15:09

Alkis Kalogeris

Related questions
                            
                                Parsing with jsoup throws error (NetworkOnMainThreadException)
                            
                                remove empty tag pairs from HTML fragment
                            
                                Handling connection errors and JSoup
                            
                                Jsoup 404 error
                            
                                Jsoup Delay due to Streaming Website
                            
                                Jsoup - Howto clean html by escaping not deleting the unwanted html?
                            
                                java.util.zip.ZIPException: Not in GZIP format
                            
                                Parsing CSS with jSoup
                            
                                Is there a way to speed up Jsoup.parse()?
                            
                                Jsoup http error fetching url
                            
                                Jsoup - Keep only the tags and remove all the text
                            
                                how to use jsoup to tidy up the html
                            
                                Jsoup check if string is valid HTML
                            
                                Jsoup get href within a class
                            
                                How to fill in Excel file using java
                            
                                Suspending all threads took: ms warning using Threads - Android
                            
                                How do I load a local html file into Jsoup?
                            
                                how to fix HTTP error fetching URL. Status=500 in java while crawling?
                            
                                Accepting relative paths in JSoup clean for <img> tags

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Crawler in Groovy (JSoup VS Crawler4j)

Tags:

web-crawler

jsoup

crawler4j

clever_bassi

People also ask

1 Answers

Alkis Kalogeris

Recent Activity

Donate For Us