i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model. Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl.... is it possible ? I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ? for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge. You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points. You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed). The killer will be how much data you need to store and what you want to do with it once you've got it.

guide on crawling the entire web?

1 Answers

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.

You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you've got it.

156

answered Sep 22 '22 07:09

cletus

Related questions
                            
                                Is it possible for Scrapy to get plain text from raw HTML data?
                            
                                Scrapy: HTTP status code is not handled or not allowed?
                            
                                How can I handle Javascript in a Perl web crawler?
                            
                                Java Web Crawler Libraries
                            
                                Scrapy - logging to file and stdout simultaneously, with spider names
                            
                                How can I safely check is node empty or not? (Symfony 2 Crawler)
                            
                                how to totally ignore 'debugger' statement in chrome?
                            
                                What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?
                            
                                Wikipedia text download
                            
                                How to generate the start_urls dynamically in crawling?
                            
                                Difference between find and filter in jquery
                            
                                How can I scrape pages with dynamic content using node.js?
                            
                                I need a Powerful Web Scraper library [closed]
                            
                                How to programmatically fill input elements built with React?
                            
                                Scrapy - Reactor not Restartable [duplicate]
                            
                                Send Post Request in Scrapy
                            
                                Passing arguments to process.crawl in Scrapy python
                            
                                How to identify web-crawler?
                            
                                unknown command: crawl error
                            
                                How to give URL to scrapy for crawling?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

guide on crawling the entire web?

Tags:

web-crawler

bohohasdhfasdf

People also ask

1 Answers

cletus

Recent Activity

Donate For Us