Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I keep columns when grouping/summarizing?

Tags:

r

dplyr

So, the problem for this question is, I cannot post actual code because of an agreement I had to sign and I'm new at R and probably unable to explain that well, , but maybe someone can help me anyway...

Let's say I have some data:

A   B    C   D
F1  6.6  10  10
F1  3.1  10  10
A1  1.0  20  10
B1  3.4  20  20

So, for every A, the C and D values are the same. But I want to use dplyr to find Bmean like so:

A    Bmean   C    D
F1   4,85    10  10
A1   1.0     20  10
B1   3.4     20  20

How would I do that? My idea was to use something like

dplyr::group_by(A) %>% dplyr::summarize(Bmean = mean(B))

but C and D seem to disappear after this operation. Would it make sense to group_by all columns I want to keep? Or how would that work?

Just to clarify, I would like to use the dplyr syntax, since it's part of a bigger operation, if possible.

like image 612
Silverclaw Avatar asked Dec 19 '22 14:12

Silverclaw


2 Answers

You can do this using base R

aggregate(data=df1,B~.,FUN = mean)
like image 147
user2100721 Avatar answered Jan 05 '23 13:01

user2100721


I would like to add an awnser which specifically solves the problem with the use of dplyr. While I'm sure, there are more elegant ways of doing this, the following proposal can retain columns with additional descriptive variables in a summarized/aggregated data frame. Also if this is not the case the code will not work protecting you from mistakes in bigger dataframes.

library(dplyr)
library(tibble)

df <- tribble(
  ~A  , ~B , ~c , ~D ,
  "F1", 6.6, 10 , 10 ,
  "F1", 3.1, 10 , 10 ,
  "A1", 1.0, 20 , 10 ,
  "B1", 3.4, 20 , 20
)

The following code drops the columns C and D

df %>%
  group_by(A) %>%
  summarise(Bmean = mean(B)) 

This code keeps the columns C and D. Note that this only works, if there is the same variable in each row of the group. But since the variables should be retained and not have an influence in thr grouping behaviour this should be the case anyways.

df %>%
  group_by(A) %>%
  summarise(Bmean = mean(B),
            C = unique(C),
            D = unique(D))

Update:

in fact you can also include the groups in the group_by expression, if the grouping levels are not "smaller" than the grouped variable

Group1:
  A ,  B ,  C ,  D
"F1", 6.6, 10 , 10 
"F1", 3.1, 10 , 10 
Group2:
"A1", 1.0, 20 , 10 
Group3:
"B1", 3.4, 20 , 20

Note that columns C and D maintain the same value within each group. This means they could safely be used in the grouping expression and thus be retained.

So in your case also this would work:

group_by(A,C,D)
like image 41
Florian Avatar answered Jan 05 '23 14:01

Florian