Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - aggregate data for 1 column by another column, based on statistices on a 3rd column

Tags:

r

aggregate

Let's say I have an R data frame with 3 columns A, B and C , Where A values are not all distinct.

How do I do I get for all values of A, the value of C for which B is minimum (for that value of A) ? Something like in pseudo SQL code : SELECT C WHERE B = MIN(B) GROUPBY A ?

I have looked at the aggregate() function but I am not sure it can get it done.

aggregate(B ~ A, data = mydataframe, min) only gives me the min of B for each A, but then I do not know how to get the corresponding C value.

Is there a way to subset the data frame with the result of this aggregation in order to get the C values, and / or can it be done in only one call of aggregate() ?

Thanks

An example of what I would like to get:

input:

A   B   C
1   0   1
1   2   2
1   1   3
1   1   4
2   1   1
2   2   2
2   0   3
2   3   4

output:

1
3

1 is the valueof C corresponding to the minimum of B (0) for A = 1

3 is the value of C corresponding to the minimum of B (0) for A = 2

like image 553
Jeanpierre Nenuphar Avatar asked Feb 19 '14 12:02

Jeanpierre Nenuphar


People also ask

How do I sum values in one column based on another column in R?

To find the sum of a column values up to a particular value in another column, we can use cumsum function with sum function.

Can aggregate function be used for multiple columns?

Any of the aggregate functions can be used on one or more than one of the columns being retrieved.

Can you aggregate more than one column in R?

We can use the aggregate() function in R to produce summary statistics for one or more variables in a data frame. where: sum_var: The variable to summarize.


2 Answers

You can use the data.table package:

library(data.table)
DT <- as.data.table(mydataframe)

DT[ , C[which.min(B)], by = "A"]
#    A V1
# 1: 1  1
# 2: 2  3

Or dplyr:

library(dplyr)
mydataframe %.%
  group_by(A) %.%
  summarise(res = C[which.min(B)])
#   A res
# 1 2   3
# 2 1   1

Or the base function by:

by(mydataframe, mydataframe$A, function(x) x$C[which.min(x$B)])
# mydataframe$A: 1
# [1] 1
# -------------------------------------------------------------------------------
# mydataframe$A: 2
# [1] 3
like image 156
Sven Hohenstein Avatar answered Oct 20 '22 14:10

Sven Hohenstein


1) SQLite guarantees that when you use min or max the other column variables will come from the same row so we get a particularly simple solution:

library(sqldf)

# one minimum per group
sqldf("select A, min(B) B, C from DF group by A")

If there can be duplicated minima and we want all of them then this select using a correlated subquery works:

# all minima per group
sqldf("select * from DF x 
      where x.b = (select min(y.b) from DF y where y.a = x.a)")

2) Using ave in the base of R we can do this:

# one minimum per group
subset(DF, !! ave(B, A, FUN = function(x) seq_along(x) == which.min(x)))

# all minima per group
subset(DF, !! ave(B, A, FUN = function(x) x == min(x)))

3) If you do want to use aggregate then do it like this:

# one minimum per group
sq <- 1:nrow(DF)
DF[aggregate(sq ~ A, DF, function(ix) ix[which.min(DF$B[ix])])$sq, ]
like image 21
G. Grothendieck Avatar answered Oct 20 '22 15:10

G. Grothendieck