Suppose I have the following frequency table.
> print(dat)
V1 V2
1 1 11613
2 2 6517
3 3 2442
4 4 687
5 5 159
6 6 29
# V1 = Score
# V2 = Frequency
How can I efficiently compute the Mean and standard deviation? Yielding: SD=0.87 MEAN=1.66. Replicating the score by frequency takes too long to compute.
The mean is the sum of the product of the midpoints and frequencies divided by the total of frequencies. Simplify the right side of μ=26713 μ = 267 13 . The equation for the standard deviation is S2=∑f⋅M2−n(μ)2n−1 S 2 = ∑ f ⋅ M 2 - n ( μ ) 2 n - 1 .
Mean from a frequency table is when we find the mean average from a data set which has been organised into a frequency table. To calculate the mean we find the total of the values and divide the total by the number of values. The number of values is the total frequency.
Mean is easy. SD is a little trickier (can't just use fastmean() again because there's an n-1 in the denominator.
> dat <- data.frame(freq=seq(6),value=runif(6)*100)
> fastmean <- function(dat) {
+ with(dat, sum(freq*value)/sum(freq) )
+ }
> fastmean(dat)
[1] 55.78302
>
> fastRMSE <- function(dat) {
+ mu <- fastmean(dat)
+ with(dat, sqrt(sum(freq*(value-mu)^2)/(sum(freq)-1) ) )
+ }
> fastRMSE(dat)
[1] 34.9316
>
> # To test
> expanded <- with(dat, rep(value,freq) )
> mean(expanded)
[1] 55.78302
> sd(expanded)
[1] 34.9316
Note that fastRMSE
calculates sum(freq)
twice. Eliminating this would probably result in another minor speed boost.
Benchmarking
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 12.433 13.5335 14.776 15.398 23.921
2 mean(with(dat, rep(value, freq))) 21.225 22.3990 22.714 23.406 86.434
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
>
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 13.177 14.544 15.8860 17.2905 54.983
2 mean(with(dat, rep(value, freq))) 42.610 48.659 49.8615 50.6385 151.053
> dat <- data.frame(freq=seq(600),value=runif(600)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 15.706 17.489 25.8825 29.615 79.113
2 mean(with(dat, rep(value, freq))) 1827.146 2283.551 2534.7210 2884.933 26196.923
The replicating solution appears to be O( N^2 ) in the number of entries.
The fastmean
solution appears to have a 12ms or so fixed cost after which it scales beautifully.
More benchmarking
Comparison with dot product.
dat <- data.frame(freq=seq(600),value=runif(600)*100)
dbaupp <- function(dat) {
total.count <- sum(dat$freq)
as.vector(dat$freq %*% dat$value) / total.count
}
microbenchmark(
fastmean(dat),
mean( with(dat, rep(value,freq) ) ),
dbaupp(dat)
)
Unit: microseconds
expr min lq median uq max
1 dbaupp(dat) 20.162 21.6875 25.6010 31.3475 104.054
2 fastmean(dat) 14.680 16.7885 20.7490 25.1765 94.423
3 mean(with(dat, rep(value, freq))) 489.434 503.6310 514.3525 583.2790 30130.302
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With