Efficiently compute mean and standard deviation from a frequency table

Tags:

Suppose I have the following frequency table.

> print(dat)
V1    V2
1  1 11613
2  2  6517
3  3  2442
4  4   687
5  5   159
6  6    29

# V1 = Score
# V2 = Frequency

How can I efficiently compute the Mean and standard deviation? Yielding: SD=0.87 MEAN=1.66. Replicating the score by frequency takes too long to compute.

528

asked May 01 '12 12:05

neversaint

1 Answers

Mean is easy. SD is a little trickier (can't just use fastmean() again because there's an n-1 in the denominator.

> dat <- data.frame(freq=seq(6),value=runif(6)*100)
> fastmean <- function(dat) {
+   with(dat, sum(freq*value)/sum(freq) )
+ }
> fastmean(dat)
[1] 55.78302
> 
> fastRMSE <- function(dat) {
+   mu <- fastmean(dat)
+   with(dat, sqrt(sum(freq*(value-mu)^2)/(sum(freq)-1) ) )
+ }
> fastRMSE(dat)
[1] 34.9316
> 
> # To test
> expanded <- with(dat, rep(value,freq) )
> mean(expanded)
[1] 55.78302
> sd(expanded)
[1] 34.9316

Note that fastRMSE calculates sum(freq) twice. Eliminating this would probably result in another minor speed boost.

Benchmarking

> microbenchmark(
+   fastmean(dat),
+   mean( with(dat, rep(value,freq) ) )
+   )
Unit: microseconds
                               expr    min      lq median     uq    max
1                     fastmean(dat) 12.433 13.5335 14.776 15.398 23.921
2 mean(with(dat, rep(value, freq))) 21.225 22.3990 22.714 23.406 86.434
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
> 
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
> microbenchmark(
+   fastmean(dat),
+   mean( with(dat, rep(value,freq) ) )
+   )
Unit: microseconds
                               expr    min     lq  median      uq     max
1                     fastmean(dat) 13.177 14.544 15.8860 17.2905  54.983
2 mean(with(dat, rep(value, freq))) 42.610 48.659 49.8615 50.6385 151.053
> dat <- data.frame(freq=seq(600),value=runif(600)*100)
> microbenchmark(
+   fastmean(dat),
+   mean( with(dat, rep(value,freq) ) )
+   )
Unit: microseconds
                               expr      min       lq    median       uq       max
1                     fastmean(dat)   15.706   17.489   25.8825   29.615    79.113
2 mean(with(dat, rep(value, freq))) 1827.146 2283.551 2534.7210 2884.933 26196.923

The replicating solution appears to be O( N^2 ) in the number of entries.

Replicating solution

The fastmean solution appears to have a 12ms or so fixed cost after which it scales beautifully.

More benchmarking

Comparison with dot product.

dat <- data.frame(freq=seq(600),value=runif(600)*100)
dbaupp <- function(dat) {
  total.count <- sum(dat$freq)
  as.vector(dat$freq %*% dat$value) / total.count
}
microbenchmark(
  fastmean(dat),
  mean( with(dat, rep(value,freq) ) ),
  dbaupp(dat)
)

Unit: microseconds
                               expr     min       lq   median       uq       max
1                       dbaupp(dat)  20.162  21.6875  25.6010  31.3475   104.054
2                     fastmean(dat)  14.680  16.7885  20.7490  25.1765    94.423
3 mean(with(dat, rep(value, freq))) 489.434 503.6310 514.3525 583.2790 30130.302

142

answered Nov 16 '22 00:11

Ari B. Friedman

Related questions
                            
                                How do I count the number of words in a text (string)?
                            
                                RODBC sqlSave table creation problems
                            
                                how to convert country codes into country names in a column within a data frame using R?
                            
                                Use dplyr to filter out columns containing characters
                            
                                How to drop columns from data frame with less than 2 unique levels in R
                            
                                Fastest way to transpose a list in R / Rcpp
                            
                                How to transform a vector into data frame with fixed dimension
                            
                                Convert column with pipe delimited data into dummy variables [duplicate]
                            
                                How to deal with zero in log plot
                            
                                how to hyperlink an image in R Markdown
                            
                                Creating a half-donut, or parliamentary seating, chart
                            
                                Replace the last occurence of a string (and only it) using regular expression
                            
                                How to measure overfitting when train and validation sample is small in Keras model
                            
                                using R - delete rows when a value repeated less than 3 times
                            
                                counting unique factors in r
                            
                                using Rcurl with HTTPs
                            
                                Problems with RODBC sqlSave
                            
                                Finding where two linear fits intersect in R
                            
                                Where is the percentile function in CRAN -R
                            
                                Language detection with data in PostgreSQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently compute mean and standard deviation from a frequency table

Tags:

r

statistics

mean

neversaint

People also ask

1 Answers

Ari B. Friedman

Recent Activity

Donate For Us