I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible: <pre class="prettyprint"><code> data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med")) d<- data %>% group_by(b) %>% summarise(min_values= min(c)) d b min_values 1 a 1.2 2 b 1.7 3 c 3.1 4 d 2.2 </code></pre> So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package. I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

Here are two options using a) <code>filter</code> and b) <code>slice</code> from dplyr. In this case there are no duplicated minimum values in column <code>c</code> for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group. a) <pre class="prettyprint"><code>> data %>% group_by(b) %>% filter(c == min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med </code></pre> Or similarly <pre class="prettyprint"><code>> data %>% group_by(b) %>% filter(min_rank(c) == 1L) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med </code></pre> b) <pre class="prettyprint"><code>> data %>% group_by(b) %>% slice(which.min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med </code></pre>

Applying group_by and summarise on data while keeping all the columns' info

Tags:

r

dplyr

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))      d<- data %>%     group_by(b) %>%     summarise(min_values= min(c))     d     b min_values     1 a        1.2     2 b        1.7     3 c        3.1     4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

662

asked May 04 '15 07:05

Momeneh Foroutan

1 Answers

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

> data %>% group_by(b) %>% filter(c == min(c)) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med

> data %>% group_by(b) %>% slice(which.min(c)) #Source: local data frame [4 x 4] #Groups: b # #   a b   c     d #1  1 a 1.2 small #2  4 b 1.7  larg #3  6 c 3.1   med #4 10 d 2.2   med

answered Sep 30 '22 18:09

talat

Related questions
                            
                                Forcing garbage collection to run in R with the gc() command
                            
                                ggplot2, facet_grid, free scales?
                            
                                How can I check whether a function call results in a warning?
                            
                                Calculate row means on subset of columns
                            
                                Access variable value where the name of variable is stored in a string
                            
                                How can I spread repeated measures of multiple variables into wide format?
                            
                                Proper/fastest way to reshape a data.table
                            
                                Draw a circle with ggplot2
                            
                                Creating multi column legend in ggplot
                            
                                Append lines to a file
                            
                                How to increase the number of columns using R in Linux
                            
                                How to use grep()/gsub() to find exact match
                            
                                Add a prefix to column names
                            
                                List all column except for one in R [duplicate]
                            
                                knitr/Rmd: page break after n lines/n distance
                            
                                Restart mixed effect model estimation with previously estimated values
                            
                                How to efficiently use Rprof in R?
                            
                                "%%" and "%/%" for the remainder and the quotient
                            
                                Plot size and resolution with R markdown, knitr, pandoc, beamer
                            
                                Comparing gather (tidyr) to melt (reshape2)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With