I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.
What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.
Famous search engines such as Google, Yahoo and Bing do web crawling and use this information for indexing web pages.
HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.
Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.
Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes.
You can set your own filter to visit pages or not (urls) and define some operation for each crawled page according to your logic.
Some reasons to select crawler4j;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With