Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a good Web Crawler tool [closed]

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.

What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

like image 664
Glenn Slaven Avatar asked Oct 07 '08 00:10

Glenn Slaven


People also ask

Is Google a web crawler or web scraper?

Famous search engines such as Google, Yahoo and Bing do web crawling and use this information for indexing web pages.


2 Answers

HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.

Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.

like image 143
anjanb Avatar answered Sep 24 '22 08:09

anjanb


Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes.

You can set your own filter to visit pages or not (urls) and define some operation for each crawled page according to your logic.

Some reasons to select crawler4j;

  1. Multi-Threaded Structure,
  2. You can Set Depth to be crawled,
  3. It is Java Based and open source,
  4. Control for redundant links (urls),
  5. You can set number of pages to be crawled,
  6. You can set page size to be crawled,
  7. Enough documentation
like image 43
cuneytykaya Avatar answered Sep 21 '22 08:09

cuneytykaya