Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

What's a good Web Crawler tool [closed]

Tags:

web-crawler

robot

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.

What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

like image

664

asked Oct 07 '08 00:10

Glenn Slaven

People also ask

Is Google a web crawler or web scraper?

Famous search engines such as Google, Yahoo and Bing do web crawling and use this information for indexing web pages.

2 Answers

HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.

Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.

like image

143

answered Sep 24 '22 08:09

anjanb

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes.

You can set your own filter to visit pages or not (urls) and define some operation for each crawled page according to your logic.

Some reasons to select crawler4j;

Multi-Threaded Structure,
You can Set Depth to be crawled,
It is Java Based and open source,
Control for redundant links (urls),
You can set number of pages to be crawled,
You can set page size to be crawled,
Enough documentation

like image

43

answered Sep 21 '22 08:09

cuneytykaya

Sign in to Comment

Related questions
                            
                                Does the url order matter in a XML sitemap?
                            
                                How do I stop all spiders and the engine immediately after a condition in a pipeline is met?
                            
                                Anybody knows a good extendable open source web-crawler? [closed]
                            
                                Replay a Scrapy spider on stored data
                            
                                Crawling and Scraping iTunes App Store
                            
                                How do I get the destination URL of a shortened URL using Ruby?
                            
                                Web Crawler - Ignore Robots.txt file?
                            
                                How to make Scrapy show user agent per download request in log?
                            
                                How to crawl billions of pages? [closed]
                            
                                AngularJS SEO using HTML5 mode: Would love some clarity on how this functions behind-the-scenes
                            
                                HtmlAgilityPack & Selenium Webdriver returns random results
                            
                                Scrapy - how to identify already scraped urls
                            
                                How to scrape all contents from infinite scroll website? scrapy
                            
                                Do Google's crawlers interpret Javascript? What if I load a page through AJAX? [closed]
                            
                                is it possible to write web crawler in javascript?
                            
                                Submit data via web form and extract the results
                            
                                How do I remove a query from a url?
                            
                                Ruby on Rails, How to determine if a request was made by a robot or search engine spider?
                            
                                How to Stop the page loading in firefox programmatically?
                            
                                Scrapy Vs Nutch [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With