Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design a storage algorithm [closed]

Here's a question for interview crackers-

Given that you are receiving samples from an instrument at a constant rate, and you have constant storage space, how would you design a storage algorithm that would allow me to get a representative readout of data, no matter when I looked at it? In other words, representative of the behavior of the system to date.

I couldn't get any idea of it. So, I am looking for ideas.

like image 694
halkujabra Avatar asked Oct 04 '12 17:10

halkujabra


2 Answers

Assume that you have the memory to store k elements. Store the first k elements in the memory in an array. Now when you receive the nth element (where n > k), generate a random number r between 1 and n. If r > k discard the nth element. Otherwise replace the rth element in the array with the nth element.

This approach will ensure that at any stage your array would contain k elements that are uniformly randomly selected from the input elements received so far.

Proof We can show by induction that the k representative elements at any stage are distributed in a uniformly random way. Assume that after receiving n-1 elements, any element is present in the array with probability k/(n-1).

After receiving the nth element, the probability that the element will be inserted into the array = k/n.

For any other element, the probability that it is represented in the current iteration = probability that it is represented in the previous iteration * probability that it is not replaced in the current iteration

= (k/(n-1)) * (n-1)/n = k/n.
like image 95
krjampani Avatar answered Nov 06 '22 15:11

krjampani


First, credit where it belongs. I'll elaborate on, rather than replace, the approach of krjampani: really nice and clear.

There's one - but not so minor - point left to investigate, and from that we'll come to a related, somewhat hidden, point in the problem.

Let's first note that we can reformulate the result, viewing it from another angle, if you will, that for any given period in time, the points (data), from the time-interval between zero and that time, are (let's say: assumed to be) uniformly distributed over the interval [1-n], from which it follows (the stated result) that their relative count in the fixed interval [1-k] should be k/n, presumably the optimal way to be representative.

We should realize, that all this is "statistical": we generate random points to control the replacement of older by newer data in the storage. The stated results, therefore, are not exact results but (statistically) "expected values".

A statistical "expectation value", however, is of course seldom what we actually get: it is merely an average over a conceptually infinite number of trying the same thing again. Whether or not the actual distribution of the data from some "period in time" over the interval [1-n] and the relevant derived value of their relative count in [1-k], is likely to be close to the (perfect) expectation value depends, in this case, on the way we generate the random numbers (between 1 and n). If that is truly random, we will be doing a Monte-Carlo-type sampling, resulting in a Gaussian-type distribution of outcomes, i.e. actual point-distributions if we would do the same thing over and over again, around the uniform distribution that we aimed for. As a corollary, until we have a very large number of points, the statistical spread will remain rather large, implying that, although the "expectation value" of our point-distribution is perfect (i.e. as we targeted), the likelyhood that in the one-shot reality we actually have something close to that distribution is not so large.

A little thinking will make it obvious that there's no way to have always, after each addition again. a perfect uniform distribution, no matter how we decide on replacing older by newer points. Our aim, therefore, must be to increase as much as possible the EXPECTED DEVIATION from it.

The issue, reformulated, is: given an interval, you're required to place points, ever more without limit, on that interval, such that their distribution is at all times "as close as possible" to being uniform. The way to do is, is to adopt a fixed "step" for each point relative to the previous one, where the stepsize is relatively prime - and preferably with two large primes - to the interval length. Example with small numbers: the interval is 11 (in some unit: the 'real' values may well be reals, not integers), the steplength is taken as 5; the steps are (k*5)mod11: 0, 5, 10, 4, 9, 3, 8, .. In our case we have the additional complication that the interval is changing in length. We may need to adapt the point-generation, for instance (I'm not sure) by arranging that any new point is placed where it would have come with the original parameters (size, step) fixed and then have its location scaled up with the actual interval-length: interval again 11, increasing by 1 each time, and step=5; points: 0, 5*(12/11), 10*(13/11), etcetera. In our case, needing integer "slots" to replace (or not) a stored value, we'll have to round to the nearest integer (and the consequences of that rounding on the statistics may induce a further adjustment of the point-generator). I have nothing more here, there's still some stuff to be worked out.

I come to the final - hidden - issue: In all of the above, we've tacitly assumed that a uniform sampling - distributing points equally over an interval - is the best way to get a representative result. Presumably we can interpret "representative result" as - say, we're looking at a particular measurement-value - a fair average of its values over a specific period of time. Picturing that the value to be measured behaves as a certain function over time, we're actually looking at the INTEGRAL of that function over the time-interval (divided by the interval-length). Now, unless the changes of that function over time are completely wild, jumping up and down and doing all kinds of fancy stuff - in which case all bets are off and you may as well do anything random - there are (theoretically and practically) established methods on how you should sample a ("normally behaving") function over an interval to get an "optimal" approximation of its integral. Random (Monte-Carlo) is truly bad (converging as 1/sqrt(N) with the number of sample-points N); uniform sampling is much better (1/N) - and in some special cases spectacularly optimal - but both are usually dwarfed by sampling at the zeros of specific polynomials, where - barring sick cases - you typically increase precision by anywhere between 0.5 and several SIGNIFICANT DIGITS at each addition of only ONE MORE point.

With the above as viewpoint, we face our original problem as follows: How can we systematically generate points on a steadily increasing interval, such that at all times, the distribution of the points over the interval is as close as possible to the distribution of the zeros of those specific (depending on what you know about the function you wish so sample, but in absence of any specific information: Legendre) polynomals over that interval (normalized to [-1:1]).

My (theoretical) approach, then, would be to use a 'constant-step-over-relatively-prime-interval' method, where - in addition to an adjustment to the fact that the interval increases, see above - the measurement of length along the interval, for "calculating" the step, is weighted by the distribution(function)of the zeros of the (Legendre) polynomials.

like image 22
Bert te Velde Avatar answered Nov 06 '22 14:11

Bert te Velde