There is a stream of numbers coming. At any point of time i might need 10% random numbers. I obviously don't want to store the entire stream.
The bigger problem is for which i am thinking the above algorithm. I have lot of data(timestamp based) which goes into the database. Now i also want to build a sample table which contains say 10% of the records in main database table but random, so that if want to query fast and i am okay with little inaccuracy i can query fast. I receive message(numbers) in batches say say sometimes 100 , sometimes 20 sometimes 5 etc.
I was thinking i will do it while streaming which the problem in question indicates. Can some one suggest a good algorithm for this. Is there a better way ?
The easy solution is to just save every 10th incoming data point, but this could potentially lead to biased results depending on just how random the data is.
If you want to simulate a truly random 10% sample on an incoming stream, you can use the Poisson Distribution, with a mean of 9, to decide how many entries to skip before logging the next one. It's probably a good idea to set a cap though, so you don't wind up with rare, yet predictably large gaps in the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With