Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating values in a data frame based on key

Tags:

idioms

r

I've got a piece of aggregation code that works well enough but runs a bit slow against a data frame with 10e6 rows. I'm not that experienced in R so apologies for my cringe worthy code!

I just want to do a basic roll up and sum of values for a common key...

eg go from...

  key val
1   a   5
2   b   7
3   a   6

to...

  key val
1   a   11
2   b   7

the best i can manage is...

keys = unique(inp$key)
vals = sapply(keys, function(x) { sum(inp[inp$key==x,]$val) })
out = data.frame(key=keys, val=vals)

I have this gut feel that the inp[inp$key==x,] is not the best way. Is there an obvious speed up i'm missing? I can do it in Hadoop (since the 10e6 dataset is actually already a rollup from a 2e9 row dataset) but I'm trying to improve my R.

Cheers, Mat

like image 938
mat kelcey Avatar asked Jul 25 '11 05:07

mat kelcey


People also ask

How do you aggregate a data frame?

Pandas DataFrame aggregate() MethodThe aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.

How do you aggregate data in a DataFrame in R?

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases.


1 Answers

Using sapply and split is another option. I'll extend via the data and benchmarks from @Chase's excellent answer.

fn.tapply <- function(daters) with(daters, tapply(val, key, FUN = sum))
fn.split <- function(daters) with(daters, sapply(split(val, key), sum))

str(dat)
# 'data.frame': 1000000 obs. of  2 variables:
#  $ key: Factor w/ 5 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ val: num  0.186 0.875 0.42 0.294 0.878 ...

benchmark(fn.tapply(dat), fn.split(dat)
          , columns = c("test", "elapsed", "relative")
          , order = "relative"
          , replications = 100
          )
#             test elapsed relative
# 2  fn.split(dat)   4.106  1.00000
# 1 fn.tapply(dat)  69.982 17.04384
like image 96
Joshua Ulrich Avatar answered Oct 15 '22 08:10

Joshua Ulrich