I need to calculate weighted means per row (6M+ rows), but it takes very long time. The column with weights is a character-field, so weighted.mean cant be used directly.
Background data:
library(data.table)
library(stringr)
values <- c(1,2,3,4)
grp <- c("a", "a", "b", "b")
weights <- c("{10,0,0,0}", "{0,10,0,0}", "{10,10,0,0}", "{0,0,10,0}")
DF <- data.frame(cbind(grp, weights))
DT <- data.table(DF)
string.weighted.mean <- function(weights.x) {
tmp.1 <- na.omit(as.numeric(unlist(str_split(string=weights.x, pattern="[^0-9]+"))))
tmp.2 <- weighted.mean(x=values, w=tmp.1)
}
Here is how it can be done (too slow) with data.frames:
DF$wm <- mapply(string.weighted.mean, DF$weights)
This does the job but is way too slow (hours):
DT[, wm:=mapply(string.weighted.mean, weights)]
How can the last line be rephrased to speed things up?
DT[, rowid := 1:nrow(DT)]
setkey(DT, rowid)
DT[, wm :={
weighted.mean(x=values, w=na.omit(as.numeric(unlist(str_split(string=weights, pattern="[^0-9]+")))))
}, by=rowid]
Since it doesn't appear that group has anything to do with the computation of the weighted mean, I tried to simplify the problem a bit.
values <- seq(4)
# A function to compute a string of length 4 with random weights 0 or 10
tstwts <- function()
{
w <- sample( c(0, 10), 4, replace = TRUE )
paste0( "{", paste(w, collapse = ","), "}" )
}
# Generate 100K strings and put them into a vector
u <- replicate( 1e5, tstwts() )
head(u) # Check
table(u)
# Function to compute a weighted mean from a string using values
# as an assumed external numeric vector 'values' of the same length as
# the weights
f <- function(x)
{
valstr <- gsub( "[\\{\\}]", "", x )
wts <- as.numeric( unlist( strsplit(valstr, ",") ) )
sum(wts * values) / sum(wts)
}
# Execute the function f recursively on the vector of weights u
v <- sapply(u, f)
# Some checks:
head(v)
table(v)
On my system, for 100K repetitions,
> system.time(sapply(u, f))
user system elapsed
3.79 0.00 3.83
A data table version of this (sans groups) would be
DT <- data.table( weights = u )
DT[, wt.mean := lapply(weights, f)] )
head(DT)
dim(DT)
On my system, this takes
system.time( DT[, wt.mean := lapply( weights, f )] ) user system elapsed 3.62 0.03 3.69
so expect about 35-40 s per million observations on a system comparable to mine (Win7, 2.8GHz dual core chip, 8GB RAM). YMMV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With