Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Dplyr's Summarize and which() to lookup min/max values

Tags:

r

dplyr

I have the following data:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed") Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43) Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")  data <- data.frame(Name, Age, Group) 

And I'd like to use dplyr to

(1) group the data by "Group" (2) show the min and max Age within each Group (3) show the Name of the person with the min and max ages

The following code does this:

data %>% group_by(Group) %>%      summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],                 maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))]) 

Which works well:

  Group minAge minAgeName maxAge maxAgeName 1     A     22        Sam     22        Sam 2     B     12      Sarah     58      James 3     C     17     Andrew     82      Sally 4     D     12     Mairin     67        Ray 

However, I have a problem if there are multiple min or max values:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed") Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43) Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")  data <- data.frame(Name, Age, Group)  > data %>% group_by(Group) %>% +   summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],  +             maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))]) Error: expecting a single value 

I'm looking for two solutions:

(1) where it doesn't matter which min or max name is shown, just that one is shown (i.e., the first value found) (2) where if there are "ties" all minimum values and maximum values are shown

Please let me know if this isn't clear and thanks in advance!

like image 814
dreww2 Avatar asked May 12 '15 16:05

dreww2


People also ask

What is MIN () and MAX ()?

The min is simply the lowest observation, while the max is the highest observation. Obviously, it is easiest to determine the min and max if the data are ordered from lowest to highest. So for our data, the min is 13 and the max is 110.

How do you select a row with maximum value in each group in R language?

Row wise maximum of the dataframe or maximum value of each row in R is calculated using rowMaxs() function. Other method to get the row maximum in R is by using apply() function. row wise maximum of the dataframe is also calculated using dplyr package.

How do you find the minimum value in R?

Minimum value of a column in R can be calculated by using min() function. min() Function takes column name as argument and calculates the Minimum value of that column.


2 Answers

You can use which.min and which.max to get the first value.

data %>% group_by(Group) %>%   summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],              maxAge = max(Age), maxAgeName = Name[which.max(Age)]) 

To get all values, use e.g. paste with an appropriate collapse argument.

data %>% group_by(Group) %>%   summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),              maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", ")) 
like image 128
shadow Avatar answered Oct 02 '22 05:10

shadow


I would actually recommend keeping your data in a "long" format. Here's how I would approach this:

library(dplyr) 

Keeping all values when there are ties:

data %>%   group_by(Group) %>%   arrange(Age) %>%  ## optional   filter(Age %in% range(Age)) # Source: local data frame [8 x 3] # Groups: Group #  #     Name Age Group # 1    Sam  22     A # 2  Sarah  31     B # 3    Jim  31     B # 4  James  58     B # 5 Andrew  17     C # 6  Sally  82     C # 7 Mairin  12     D # 8    Ray  67     D 

Keeping only one value when there are ties:

data %>%   group_by(Group) %>%   arrange(Age) %>%   slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill? # Source: local data frame [7 x 3] # Groups: Group #  #     Name Age Group # 1    Sam  22     A # 2  Sarah  31     B # 3  James  58     B # 4 Andrew  17     C # 5  Sally  82     C # 6 Mairin  12     D # 7    Ray  67     D 

If you really want a "wide" dataset, the basic concept would be to gather and spread the data, using "tidyr":

library(dplyr) library(tidyr)  data %>%   group_by(Group) %>%   arrange(Age) %>%   slice(c(1, n())) %>%   mutate(minmax = c("min", "max")) %>%   gather(var, val, Name:Age) %>%   unite(key, minmax, var) %>%   spread(key, val) # Source: local data frame [4 x 5] #  #   Group max_Age max_Name min_Age min_Name # 1     A      22      Sam      22      Sam # 2     B      58    James      31    Sarah # 3     C      82    Sally      17   Andrew # 4     D      67      Ray      12   Mairin 

Though what wide form you would want with ties is unclear.

like image 29
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 02 '22 07:10

A5C1D2H2I1M1N2O1R2T1