Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating Data in R with user defined function

Tags:

r

aggregate

I have grouped data in R using the aggregate method.

Avg=aggregate(x$a, by=list(x$b,x$c),FUN= mean)

This gives me the mean for all the values of 'a' grouped by 'b' and 'c' of data frame 'x'.

Now instead of taking the average of all values of 'a' I want to take the average of 3 maximum values of 'a' grouped by 'b' and 'c' .

Sample data set

a    b    c
10   G    3 
20   G    3 
22   G    3
10   G    3 
15   G    3
25   G    3
30   G    3

After above Aggregate function it will give me

Group.1    Group.2    x
  G          3       18.85

But I want to take just the maximum 5 values of 'a' for average

Group.1    Group.2    x
  G          3       22.40

I am not able to accommodate the below maximum function that i am using in the Agrregate function

index <- order(vector, decreasing = T)[1:5]
vector(index)

Can please anyone throw some light on how is this possible ?

like image 216
user3812709 Avatar asked Aug 21 '14 16:08

user3812709


People also ask

How do you aggregate a dataset in R?

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases.

How do you use an aggregate function in R?

In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean ) on the third. An alternative is to specify a formula of the form: numerical ~ categorical .


1 Answers

You can order the data, get the top 5 entries (using head) and then apply the mean:

aggregate(x$a, by=list(x$b,x$c),FUN= function(x) mean(head(x[order(-x)], 5)))
#  Group.1 Group.2    x
#1       G       3 22.4

If you want to do this with a custom function, I would do it like this:

myfunc <- function(vec, n){
  mean(head(vec[order(-vec)], n))
}

aggregate(x$a, by=list(x$b,x$c),FUN= function(z) myfunc(z, 5))
#  Group.1 Group.2    x
#1       G       3 22.4

I actually prefer using the formula style in aggregate which would look like this (I also use with() to be able to refer to the column names directly without using x$ each time):

with(x, aggregate(a ~ b + c, FUN= function(z) myfunc(z, 5)))
#  b c    a
#1 G 3 22.4

In this function, the parameter z is passed each a-vector based on groups of b and c. Does that make more sense now? Also note that it doesn't return an integer here but a numeric (decimal, 22.4 in this case) value.

like image 69
talat Avatar answered Nov 03 '22 12:11

talat