Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the R way to do the following group by?

Tags:

r

group-by

I have some dataset like this:

# date     # value    class
1984-04-01 95.32384   A
1984-04-01 39.86818   B
1984-07-01 43.57983   A
1984-07-01 10.83754   B

Now I would like to group the data by data and subtract the value of class B from class A. I looked into ddply, summarize, melt and aggregate but cannot quite get what I want. Is there a way to do it easily? Note that I have exactly two values per date one of class A and one of class B. I mean i could re-arrange it into two dfs order it by date and class and merge it again, but I feel there is a more R way to do it.

like image 341
Matt Bannert Avatar asked Jun 16 '11 10:06

Matt Bannert


People also ask

What does Groupby () do in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.

How do you group by and then count in R?

Group By Count in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by count in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df to get the group by count.

How do you sum a column by a group in R?

How to do group by sum in R? By using aggregate() from R base or group_by() function along with the summarise() from the dplyr package you can do the group by on dataframe on a specific column and get the sum of a column for each group.


1 Answers

Assuming this data frame (generated as in Prasad's post but with a set.seed for reproducibility):

set.seed(123)
DF <- data.frame( date = rep(seq(as.Date('1984-04-01'), 
                                 as.Date('1984-04-01') + 3, by=1), 
                            1, each=2),
                  class = rep(c('A','B'), 4),
                  value = sample(1:8))

then we consider seven solutions:

1) zoo can give us a one line solution (not counting the library statement):

library(zoo)
z <- with(read.zoo(DF, split = 2), A - B)

giving this zoo series:

> z
1984-04-01 1984-04-02 1984-04-03 1984-04-04 
        -3          3          3         -5 

Also note that as.data.frame(z) or data.frame(time = time(z), value = coredata(z)) gives a data frame; however, you may wish to leave it as a zoo object since it is a time series and other operations are more conveniently done on it in this form, e.g. plot(z)

2) sqldf can also give a one statement solution (aside from the library invocation):

> library(sqldf)
> sqldf("select date, sum(((class = 'A') - (class = 'B')) * value) as value
+ from DF group by date")
        date value
1 1984-04-01    -3
2 1984-04-02     3
3 1984-04-03     3
4 1984-04-04    -5

3) tapply can be used as the basis of a solution inspired by the sqldf solution:

> with(DF, tapply(((class =="A") - (class == "B")) * value, date, sum))
1984-04-01 1984-04-02 1984-04-03 1984-04-04 
        -3          3          3         -5 

4) aggregate can be used in the same way as sqldf and tapply above (although a slightly different solution also based on aggregate has already appeared):

> aggregate(((DF$class=="A") - (DF$class=="B")) * DF["value"], DF["date"], sum)
        date value
1 1984-04-01    -3
2 1984-04-02     3
3 1984-04-03     3
4 1984-04-04    -5

5) summaryBy from the doBy package can provide yet another solution although it does need a transform to help it along:

> library(doBy)
> summaryBy(value ~ date, transform(DF, value = ((class == "A") - (class == "B")) * value), FUN = sum, keep.names = TRUE)
        date value
1 1984-04-01    -3
2 1984-04-02     3
3 1984-04-03     3
4 1984-04-04    -5

6) remix from the remix package can do it too but with a transform and features particularly pretty output:

> library(remix)
> remix(value ~ date, transform(DF, value = ((class == "A") - (class == "B")) * value), sum)
value ~ date
============

+------+------------+-------+-----+
|                           | sum |
+======+============+=======+=====+
| date | 1984-04-01 | value | -3  |
+      +------------+-------+-----+
|      | 1984-04-02 | value | 3   |
+      +------------+-------+-----+
|      | 1984-04-03 | value | 3   |
+      +------------+-------+-----+
|      | 1984-04-04 | value | -5  |
+------+------------+-------+-----+

7) summary.formula in the Hmisc package also has pretty output:

> library(Hmisc)
> summary(value ~ date, data = transform(DF, value = ((class == "A") - (class == "B")) * value), fun = sum, overall = FALSE)
value    N=8

+----+----------+-+-----+
|    |          |N|value|
+----+----------+-+-----+
|date|1984-04-01|2|-3   |
|    |1984-04-02|2| 3   |
|    |1984-04-03|2| 3   |
|    |1984-04-04|2|-5   |
+----+----------+-+-----+
like image 180
G. Grothendieck Avatar answered Sep 27 '22 23:09

G. Grothendieck