Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find Top 10 Most Frequent visited URl, data is stored across network

Source: Google Interview Question

Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.

Have many large <string (url) -> int (visits)> maps.

Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.

Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.

I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.

All I could think of is brute force, which in some or other way, violates the given constraint.

like image 301
Spandan Avatar asked Jul 29 '13 15:07

Spandan


1 Answers

It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:

  1. pre-processing: let R be the number of servers in cluster, give each server unique id from 0,1,2,...,R-1
  2. (map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
  3. (reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
  4. (map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
  5. (reduce) Server 0 will yield the top 10 across the network.

Notes:

  • The problem with the algorithm, like most big-data algorithms that don't use frameworks is handling failing servers. MapReduce takes care of it for you.
  • The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
like image 194
amit Avatar answered Oct 19 '22 15:10

amit