Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

design a system supporting massive data storage and query

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.

Description:

There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:

struct Record {
    url;  //char *
    IP;   //int?
    visit_time;   //time_t or simply a number?
}

Requirements:

Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:

First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.

Second, given a time period (t1, t2) and a url, query how many times this url has been visited.

I was stumbled, and here is my stupid solution:

Analysis:

because every query is performed upon a given period of time, so:

1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.

2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.

3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.

ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;

url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.

Give a drawing as follows:

time_hastbl
  []
  []
  []-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
  []                     |          |
  []               ip_hastbl       url_hastbl
                      []               []
                      :                :
                      []               []
                      []               []

So, when doing the query upon (t1, t2):

  1. find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';

  2. for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.

Questions:

1.My solution is stupid, hope you can give me another solution.

2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?

like image 958
Alcott Avatar asked Sep 15 '11 03:09

Alcott


3 Answers

I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)

like image 71
vine'th Avatar answered Nov 14 '22 15:11

vine'th


In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.

like image 37
Steven Avatar answered Nov 14 '22 17:11

Steven


Old question, but recently bumped so here's a few other things to think about:

What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:

First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.

If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).

Second, given a time period (t1, t2) and a url, query how many times this url has been visited.

Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.

Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.

Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).

Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?

Why is this important? Because it highlights one key thing you overlooked in your answer.

With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.

Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?

If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).

One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.

Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"

So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.

like image 2
Seph Avatar answered Nov 14 '22 17:11

Seph