Asked this on cs.stackexchange.. got a downvote.. because i was not very clear.. So I will try to be more specific here..
Q - design a datastructure to return the number of connections to a web server in last 1 minute..
Assumptions -
I am looking for:
Efficiency - Is it possible to do this in O(1)? If, say, we do it in O(n).. the problem is that if it takes N milliseconds to compute the answer.. there are some more connections that have been enqueued in that N ms.. How should I tackle this.. or I can only ignore the small delay and O(n) is an OK performance
Reasoning/Approach - Do we do something similar to this in any myriads of deployments we have in production? Is there a similar problem..?
Is this "Big data"? Is the data for storing connections in last N (N is of order 10) minutes of the order a Big Data problem?
My Effort: I know that -
Approach -
I also run a daemon that removes entries/files older than 10 mins.. so that i dont store unwanted data
Queue approach
Use a queue.
Whenever a new request comes it, enqueue it.
Whenever we need to get the number of requests in the last minute, first dequeue all entries that occurred longer than a minute ago (this works because the entries in the queue were inserted in order, thus the queue is sorted). Then return the size of the queue (size can be returned in O(1)
by having a variable storing this).
Since you say there are lots of requests per second, you can also dequeues old entries whenever you get a new request, which should keep the queue close to the right size, thus minimizing the time taken for the count operation.
This will be O(k)
for both enqueue and get count, where k
is the maximum number of requests that occurred in any time period of length t
, where t
is the maximum duration between any two requests (as an example, if we get requests at the following milliseconds: 1,2,3,5,15,20
, the maximum duration is 20-15 = 5
, and the maximum number of requests occurring in 5
millisecond is 4
(i.e. 1,2,3,5
occurs in the period 1-5
)).
An alternative is having a timer running which dequeues old entries.
The performance of this will be O(1)
for enqueue and O(m)
for the timer run and get count, where m
is the maximum number of requests for any period of length equal to the timer frequency. As an example, use 1,2,3,5,15,20
again, with a timer interval of 6
milliseconds. The maximum number of requests occurring in 6
millisecond would be 4
(i.e. 1,2,3,5
occurs in the period 1-6
).
Is this "Big data"?
Well, that really depends on the number of requests you get per second. We typically talk about many terrabyte of data for big data (and increases as computers get more powerful), and I don't think only requests in the last minute would get close to those numbers.
Count-based approach
If you're happy with a small error, you can do the following:
Have a queue of fixed size (let's say 60, with each element indicating the requests for a given second - being a simple integer value). This can be implemented as a circular array.
Let count
= a variable indicating the number of requests for the last second, initialized to 0.
Have a variable storing the second value of the last request (to be able to determine which second an element in the queue is for).
When we get a new request, dequeue all elements indicating seconds longer than a minute ago, decrementing count
by the dequeued value, and increment the back of the queue (the last inserted value) along with count
.
When we need to get the number of requests the last second, dequeue all elements indicating seconds longer than a minute ago, decrementing count
by the dequeued value, and return count
.
Given that we have the queue is fixed-size, each operation will take at most O(1)
(or, if you'd like, O(m)
where m
is the size of the queue).
This will give you an error of at most 1 second. You can of course play around with the error. For example, if you want a half-a-second error, simply double the size of the queue and go to the next element every half-a-second.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With