Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying group_by and summarise on data while keeping all the columns' info

Tags:

r

dplyr

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))      d<- data %>%     group_by(b) %>%     summarise(min_values= min(c))     d     b min_values     1 a        1.2     2 b        1.7     3 c        3.1     4 d        2.2 

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

like image 662
Momeneh Foroutan Avatar asked May 04 '15 07:05

Momeneh Foroutan


People also ask

What is the purpose of Group_by () function?

Group_by() function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by() function alone will not give any output. It should be followed by summarise() function with an appropriate action to perform. It works similar to GROUP BY in SQL and pivot table in excel.

How do you summarize multiple columns?

For example, to sum one column of cells, you may use "=SUM(A2:A32)" or to sum two columns you may use "=SUM(A2:A32,B2:B32)." Press "Enter" to display your results in the selected cell. If you added two columns with the range argument, the formula displays the results of both ranges added together.

What does Summarise () do in R?

Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.


1 Answers

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c)) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med 

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med 

b)

> data %>% group_by(b) %>% slice(which.min(c)) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med 
like image 60
talat Avatar answered Sep 30 '22 18:09

talat