I've got a piece of aggregation code that works well enough but runs a bit slow against a data frame with 10e6 rows. I'm not that experienced in R so apologies for my cringe worthy code!
I just want to do a basic roll up and sum of values for a common key...
eg go from...
key val
1 a 5
2 b 7
3 a 6
to...
key val
1 a 11
2 b 7
the best i can manage is...
keys = unique(inp$key)
vals = sapply(keys, function(x) { sum(inp[inp$key==x,]$val) })
out = data.frame(key=keys, val=vals)
I have this gut feel that the inp[inp$key==x,]
is not the best way. Is there an obvious speed up i'm missing? I can do it in Hadoop (since the 10e6 dataset is actually already a rollup from a 2e9 row dataset) but I'm trying to improve my R.
Cheers, Mat
Pandas DataFrame aggregate() MethodThe aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.
The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases.
Using sapply
and split
is another option. I'll extend via the data and benchmarks from @Chase's excellent answer.
fn.tapply <- function(daters) with(daters, tapply(val, key, FUN = sum))
fn.split <- function(daters) with(daters, sapply(split(val, key), sum))
str(dat)
# 'data.frame': 1000000 obs. of 2 variables:
# $ key: Factor w/ 5 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ val: num 0.186 0.875 0.42 0.294 0.878 ...
benchmark(fn.tapply(dat), fn.split(dat)
, columns = c("test", "elapsed", "relative")
, order = "relative"
, replications = 100
)
# test elapsed relative
# 2 fn.split(dat) 4.106 1.00000
# 1 fn.tapply(dat) 69.982 17.04384
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With