Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

design a datastructure to return the number of connections to a web server in last 1 minute [duplicate]

Asked this on cs.stackexchange.. got a downvote.. because i was not very clear.. So I will try to be more specific here..

Q - design a datastructure to return the number of connections to a web server in last 1 minute..

Assumptions -

  1. The server has a heavy incoming number of connection.. like Indian railways reservation OR a social nw site etc..
  2. Suppose if this is a big data problem .. then I have the infra to run a big data job ..

I am looking for:

  1. Efficiency - Is it possible to do this in O(1)? If, say, we do it in O(n).. the problem is that if it takes N milliseconds to compute the answer.. there are some more connections that have been enqueued in that N ms.. How should I tackle this.. or I can only ignore the small delay and O(n) is an OK performance

  2. Reasoning/Approach - Do we do something similar to this in any myriads of deployments we have in production? Is there a similar problem..?

  3. Is this "Big data"? Is the data for storing connections in last N (N is of order 10) minutes of the order a Big Data problem?

My Effort: I know that -

  1. connections to a web server are put in a queue before being served by a thread
  2. each connection has a timestamp

Approach -

  1. As soon as a connection is put in the Queue, write it to a file.. (at least its timestamp and a handle/unique identifier to the connection)
  2. As soon as client requests "give me num connections in last 1 minute" .. process the file to find out the number of connections .. we know current time in millisecs and that we have to find connections the timestamp of which fall in currentTime - 60 secs
  3. This job can be run using map reduce .. I also know that the file has sorted data (by timestamp)..

I also run a daemon that removes entries/files older than 10 mins.. so that i dont store unwanted data

like image 778
abipc Avatar asked Aug 23 '13 06:08

abipc


1 Answers

Queue approach

  • Use a queue.

  • Whenever a new request comes it, enqueue it.

  • Whenever we need to get the number of requests in the last minute, first dequeue all entries that occurred longer than a minute ago (this works because the entries in the queue were inserted in order, thus the queue is sorted). Then return the size of the queue (size can be returned in O(1) by having a variable storing this).

  • Since you say there are lots of requests per second, you can also dequeues old entries whenever you get a new request, which should keep the queue close to the right size, thus minimizing the time taken for the count operation.

    This will be O(k) for both enqueue and get count, where k is the maximum number of requests that occurred in any time period of length t, where t is the maximum duration between any two requests (as an example, if we get requests at the following milliseconds: 1,2,3,5,15,20, the maximum duration is 20-15 = 5, and the maximum number of requests occurring in 5 millisecond is 4 (i.e. 1,2,3,5 occurs in the period 1-5)).

  • An alternative is having a timer running which dequeues old entries.

    The performance of this will be O(1) for enqueue and O(m) for the timer run and get count, where m is the maximum number of requests for any period of length equal to the timer frequency. As an example, use 1,2,3,5,15,20 again, with a timer interval of 6 milliseconds. The maximum number of requests occurring in 6 millisecond would be 4 (i.e. 1,2,3,5 occurs in the period 1-6).

Is this "Big data"?

Well, that really depends on the number of requests you get per second. We typically talk about many terrabyte of data for big data (and increases as computers get more powerful), and I don't think only requests in the last minute would get close to those numbers.

Count-based approach

If you're happy with a small error, you can do the following:

Have a queue of fixed size (let's say 60, with each element indicating the requests for a given second - being a simple integer value). This can be implemented as a circular array.

Let count = a variable indicating the number of requests for the last second, initialized to 0.

Have a variable storing the second value of the last request (to be able to determine which second an element in the queue is for).

When we get a new request, dequeue all elements indicating seconds longer than a minute ago, decrementing count by the dequeued value, and increment the back of the queue (the last inserted value) along with count.

When we need to get the number of requests the last second, dequeue all elements indicating seconds longer than a minute ago, decrementing count by the dequeued value, and return count.

Given that we have the queue is fixed-size, each operation will take at most O(1) (or, if you'd like, O(m) where m is the size of the queue).

This will give you an error of at most 1 second. You can of course play around with the error. For example, if you want a half-a-second error, simply double the size of the queue and go to the next element every half-a-second.

like image 74
Bernhard Barker Avatar answered Oct 06 '22 00:10

Bernhard Barker