Whats the fastest way to get an approximate count of number of rows of an input file or std out data stream. FYI, this is a probabilistic algorithm, I can't find many examples online.
The data could just be one or 2 columns coming from an awk script of csv file! Lets say i want an aprox groupby on one of the columns. I would use a database group by but the number of rows are over 6-7 billion. I would like the first approx result In under 3 to 4 seconds. Then run a bayes or something after decisions are made on the prior. Any ideas on a really rough initial group count?
If you can provide the algorithm example in python, or java that would be very helpful.
@Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a groupby
, to estimate the percentage of different groups makes more sense).
I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), group1
and group2
.
For m
group1
s out of the first n
lines(rows) you processed, we denote the event as M(m,n)
. Obviously you will see n-m
group2
s because we assume they are the only two possible groups. So you know the conditional probability of the event M(m,n)
given the percentage of group1
(s
), is given by the binomial distribution with n
trials. We are trying to estimate s
in a bayesian way.
The conjugate prior for binomial is beta distribution. So for simplicity, we choose Beta(1,1)
as the prior (of course, you can pick your own parameters here for alpha
and beta
), which is a uniform distribution on (0,1). Therefor, for this beta distribution, alpha=1
and beta=1
.
The recursive update formulas for a binomial + beta prior are as below:
if group == 'group1':
alpha = alpha + 1
else:
beta = beta + 1
The posterior of s
is actually also a beta distribution:
s^(m+alpha-1) (1-s)^(n-m+beta-1)
p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta)
B(m+alpha, n-m+beta)
where B
is the beta function. To report the estimate result, you can rely on Beta
distribution's mean and variance, where:
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
groupby.py
So a few lines of python to process your data from stdin
and estimate the percentage of group1
would be something like below:
import sys
alpha = 1.
beta = 1.
for line in sys.stdin:
data = line.strip()
if data == 'group1':
alpha += 1.
elif data == 'group2':
beta += 1.
else:
continue
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
print 'mean = %.3f, var = %.3f' % (mean, var)
I feed a few lines of data to the code:
group1
group1
group1
group1
group2
group2
group2
group1
group1
group1
group2
group1
group1
group1
group2
And here is what I get as results:
mean = 0.667, var = 0.056
mean = 0.750, var = 0.037
mean = 0.800, var = 0.027
mean = 0.833, var = 0.020
mean = 0.714, var = 0.026
mean = 0.625, var = 0.026
mean = 0.556, var = 0.025
mean = 0.600, var = 0.022
mean = 0.636, var = 0.019
mean = 0.667, var = 0.017
mean = 0.615, var = 0.017
mean = 0.643, var = 0.015
mean = 0.667, var = 0.014
mean = 0.688, var = 0.013
mean = 0.647, var = 0.013
The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points.
Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding conjugate prior would be Dirichlet. Everything else you just make similar changes.
You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g.,
head -n100000 YOURDATA.txt | python groupby.py
That's it. Hope it helps.
If it's reasonable to assume the data are IID (so there's no bias such as certain types of records occur in certain parts of the stream), then just subsample and scale up the counts by approximate size.
Take say the first million records (this should be processable in a couple of seconds). Its size is x units (MB, chars, whatever you care about). The full stream has size y where y >> x. Now, derive counts for whatever you care about from your sample x, and simply scale them by the factor y/*x* for approximate full-counts. An example: you want to know roughly how many records have column 1 with value v in the full stream. The first million records have a file size of 100MB, while the total file size is 10GB. In the first million records, 150,000 of them have value v for column 1. So, you assume that in the full 10GB of records, you'll see 150,000 * (10,000,000,000 / 100,000,000) = 15,000,000 with that value. Any statistics you compute on the sample can simply be scaled by the same factor to produce an estimate.
If there is bias in the data such that certain records are more or less likely to be in certain places of the file then you should select your sample records at random (or evenly spaced intervals) from the total set. This is going to ensure an unbiased, representative sample, but probably incur a much greater I/O overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With