Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by columns and summarize a column into a list

Tags:

r

group-by

dplyr

I have a dataframe like this:

sample_df<-data.frame(
   client=c('John', 'John','Mary','Mary'),
   date=c('2016-07-13','2016-07-13','2016-07-13','2016-07-13'),
   cluster=c('A','B','A','A'))

#sample data frame
   client date         cluster
1  John   2016-07-13    A 
2  John   2016-07-13    B 
3  Mary   2016-07-13    A 
4  Mary   2016-07-13    A             

I would like to transform it into different format, which will be like:

#ideal data frame
   client date         cluster
1  John   2016-07-13    c('A,'B') 
2  Mary   2016-07-13    A 

For the 'cluster' column, it will be a list if some client is belong to different cluster on the same date.

I thought I can do it with dplyr package with commend as below

library(dplyr)
ideal_df<-sample %>% 
    group_by(client, date) %>% 
    summarize( #some anonymous function)

However, I don't know how to write the anonymous function in this situation. Is there a way to transform the data into the ideal format?

like image 801
Johnny Chiu Avatar asked Jul 13 '16 09:07

Johnny Chiu


People also ask

How do you summarize multiple columns?

Press "Ctrl + Space" to select it, then hold "Shift" and using the lateral arrow keys to select the other columns. After selecting all the columns you want to add together, the bar should display a formula such as "=SUM(A:C)," with the range displaying the column letter names.

How do I summarize a column in R?

The summarise_all method in R is used to affect every column of the data frame. The output data frame returns all the columns of the data frame where the specified function is applied over every column. Arguments : data – The data frame to summarise the columns of.

Can you group by 2 variables in R?

One great feature of the group_by function is its ability to group by more than one variable to show what the aggregated data looks like for combinations of the different variables across the response variable. All that you need to do is add a comma between the different variables in group_by .

How do I group values together in R?

The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories.


1 Answers

We can use toString to concat the unique elements in 'cluster' together after grouping by 'client'

r1 <- sample_df %>% 
         group_by(client, date) %>%
         summarise(cluster = toString(unique(cluster)))

Or another option would be to create a list column

r2 <- sample_df %>%
         group_by(client, date) %>% 
         summarise(cluster = list(unique(cluster)))

which we can unnest

library(tidyr)
r2 %>%
    ungroup %>%
     unnest()
like image 185
akrun Avatar answered Oct 05 '22 15:10

akrun