I faced this problem in a recent interview:
You have a stream of incoming numbers in range 0 to 60000
and you have a function which will take a number from that range and return the count of occurrence of that number till that moment. Give a suitable Data structure/algorithm to implement this system.
My solution is:
Make an array of size 60001 pointing to bit-vectors. These bit vectors will contain the count of the incoming numbers and the incoming numbers will also be used to index into the array for the corresponding number. Bit-vectors will dynamically increase as the count gets too big to hold in them.
So, if the numbers are coming at rate 100numbers/sec
then, in 1million years total numbers will be = (100*3600*24)*365*1000000 = 3.2*10^15
. In the worst case where all numbers in the stream is same it will take ceil((log(3.2*10^15) / log 2) )= 52bits
and if the numbers are uniformly distributed the we will have (3.2*10^15) / 60001 = 5.33*10^10
number of occurrences for each number which will require total of 36 bits
for each numbers.
So, assuming 4byte pointers we need (60001 * 4)/1024 = 234 KB
memory for the array and for the case with same numbers, we need bit vector size = 52/8 = 7.5 bytes
which is still around 234KB. And for the other case we need (60001 * 36 / 8)/1024 = 263.7 KB
for bit vector totaling about 500KB. So, it is very much feasible to do this with ordinary PC and memory.
But the interviewer said, as it is infinite stream it will eventually overflow and gave me hint like how can we do this if there were many PCs and we could pass messages between them or think about file system etc. But I kept thinking if this solution was not working then, others would too. Needless to say, I did not get the job.
How to do this problem with less memory? Can you think of an alternative approach (using network of PCs may be)?
A formal model for the problem could be the following.
We want to know if it exists a constant space bounded Turing machine such that, in any given time it recognizes the language L of all couples (number,number of occurrences so far). This means that all correct couples will be accepted and all incorrect couples will be rejected.
As a corollary of the Theorem 3.13 in Hopcroft-Ullman we know that every language recognized by a constant space bounded machine is regular.
It can be proven by using the pumping lemma for regular languages that the language described above is not a regular language. So you can't recognize it with a constant space bounded machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With