Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build difference between groups with dplyr in r

I am using dplyr and I am wondering whether it is possible to compute differences between groups in one line. As in the small example below, the task is to compute the difference between groups A and Bs standardized "cent" variables.

library(dplyr)
# creating a small data.frame
GROUP <- rep(c("A","B"),each=10)
NUMBE <- rnorm(20,50,10)
datf <- data.frame(GROUP,NUMBE)

datf2 <- datf %.% group_by(GROUP) %.% mutate(cent = (NUMBE - mean(NUMBE))/sd(NUMBE))

gA <- datf2 %.% ungroup() %.% filter(GROUP == "A") %.% select(cent)
gB <- datf2 %.% ungroup() %.% filter(GROUP == "B") %.% select(cent)

gA - gB

This is of course no problem by creating different objects - but is there a more "built in" way of performing this task? Something more like this not working fantasy code below?

datf2 %.% summarize(filter(GROUP == "A",select(cent)) - filter(GROUP == "B",select(cent)))

Thank you!

like image 432
Manuel Avatar asked Mar 23 '14 10:03

Manuel


People also ask

What does group by in dplyr do?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".

How do I find the difference between two rows in R?

The data frame indexing methods can be used to calculate the difference of rows by group in R. The 'by' attribute is to specify the column to group the data by. All the rows are retained, while a new column is added in the set of columns, using the column to take to compute the difference of rows by the group.

What is the difference between summarize and mutate in R?

mutate() either changes an existing column or adds a new one. summarise() calculates a single value (per group).

Why do we use dplyr in R?

The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.


2 Answers

Given we have 10 of each group, add an index 1:10, 1:10 and summarize over that with difference:

> datf2$entry=c(1:10,1:10)
> datf2 %.% ungroup() %.% group_by(entry) %.% summarize(d=cent[1]-cent[2])
Source: local data frame [10 x 2]

   entry          d
1      1 -0.8272879
2      2 -0.9159827
3      3 -0.5064762
4      4  0.4211639
5      5  1.3681720
6      6  3.3430289
7      7  1.0086822
8      8 -0.6163907
9      9 -0.7325220
10    10 -2.5423875

compare:

> gA - gB
         cent
1  -0.8272879
2  -0.9159827
3  -0.5064762
4   0.4211639
5   1.3681720
6   3.3430289
7   1.0086822
8  -0.6163907
9  -0.7325220
10 -2.5423875

Is there a way to inject the entry field into the data or the dplyr call? I'm not sure, it seems to rely on the functions knowing too much about the data...

like image 169
Spacedman Avatar answered Sep 19 '22 21:09

Spacedman


Thank you for the inspiration. I further developed this solution to that:

mutate(datf2,diffence = filter(datf2, GROUP == "A")$cent - filter(datf2, GROUP == "B")$cent)

This adds the result as column in the the data.frame.

like image 29
Manuel Avatar answered Sep 21 '22 21:09

Manuel