Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best Open Source Web Crawler Tool written in Java? [closed]

What is the best Open Source Web Crawler Tool, written in Java.

like image 876
cuneytykaya Avatar asked Dec 12 '11 12:12

cuneytykaya


People also ask

Which web crawler is best?

Apache Nutch is unquestionably at the top of the web crawler tool heap when it comes to the greatest open source web crawlers. Apache Nutch is a prominent open source code web data extraction software project for data mining that is highly flexible and scalable.

What is an open source crawler?

What are open source crawlers? Web crawlers are a type of software that automatically targets online websites and pulls their data in a machine-readable format. Open source web crawlers enable users to: modify the code and customize their web crawlers to achieve business goals.

What is a Java web crawler?

The web crawler is basically a program that is mainly used for navigating to the web and finding new or updated pages for indexing. The crawler begins with a wide range of seed websites or popular URLs and searches depth and breadth to extract hyperlinks.

Is Google a web crawler or web scraper?

Famous search engines such as Google, Yahoo and Bing do web crawling and use this information for indexing web pages.


2 Answers

Try crawler4j. You just need to implement a simple interface which controls which URLs to visit and what to do with each crawled page.

like image 119
Andy Avatar answered Nov 09 '22 03:11

Andy


in java I think it boils down to Nutch vs Heritrix. You should specify what your needs are to get a better answer.

like image 34
riffraff Avatar answered Nov 09 '22 05:11

riffraff