Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - dplyr Summarize and Retain Other Columns

Tags:

r

dplyr

summarize

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks

TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE)) TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]   TDAA.states <- TDAAtest %>%   filter(!is.na(State)) %>%   group_by(State) %>%   summarize(n=n()) %>%   ungroup() %>%   arrange(State) 
like image 658
atclaus Avatar asked Aug 23 '16 03:08

atclaus


People also ask

What does %>% do in Dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).

What does Summarise () do in R?

Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.

What does Summarise N N ()) do in R?

n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.


2 Answers

Perhaps we need

TDAAtest %>%       filter(!is.na(State)) %>%      group_by(State) %>%       summarise(State.Full.Name = first(State.Full.Name), n = n()) 

Or use mutate to create the column and then do the distinct

TDAAtest %>% f      filter(!is.na(State)) %>%      group_by(State) %>%       mutate(n= n()) %>%       distinct(State, .keep_all=TRUE) 
like image 126
akrun Avatar answered Sep 21 '22 13:09

akrun


I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column ).

Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).

We have the following options:

1. dplyr version

From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:

TDAAtest %>%       filter(!is.na(State)) %>%      group_by(State) %>%       mutate(maxPopulation = max(Population)) %>%       ungroup() %>%      filter(maxPopulation == Population)  

2. Function version

This one gives you as much flexibility as you want and you can apply any kind of operation to each group:

maxFUN = function(x) {   # order population in a descending order   x = x[with(x, order(-Population)), ]   x[1, ] }  TDAAtest %>%       filter(!is.na(State)) %>%      group_by(State) %>%      do(maxFUN(.))  

This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

like image 20
Habib Karbasian Avatar answered Sep 21 '22 13:09

Habib Karbasian