Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Architecture - How to efficiently crawl the web with 10,000 machine?

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?

Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?

like image 805
Martin Avatar asked Nov 05 '22 12:11

Martin


1 Answers

  1. Use a distributed Map Reduction system like Hadoop to divide the workspace.
  2. If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
  3. Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.
like image 137
Martin Spamer Avatar answered Nov 11 '22 15:11

Martin Spamer