I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name
after grouping by State
. Thanks
TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE)) TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)] TDAA.states <- TDAAtest %>% filter(!is.na(State)) %>% group_by(State) %>% summarize(n=n()) %>% ungroup() %>% arrange(State)
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
Perhaps we need
TDAAtest %>% filter(!is.na(State)) %>% group_by(State) %>% summarise(State.Full.Name = first(State.Full.Name), n = n())
Or use mutate
to create the column and then do the distinct
TDAAtest %>% f filter(!is.na(State)) %>% group_by(State) %>% mutate(n= n()) %>% distinct(State, .keep_all=TRUE)
I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column ).
Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county
and population
columns).
We have the following options:
1. dplyr version
From this link, you have three extra operations (mutate
, ungroup
and filter
) to achieve that:
TDAAtest %>% filter(!is.na(State)) %>% group_by(State) %>% mutate(maxPopulation = max(Population)) %>% ungroup() %>% filter(maxPopulation == Population)
2. Function version
This one gives you as much flexibility as you want and you can apply any kind of operation to each group:
maxFUN = function(x) { # order population in a descending order x = x[with(x, order(-Population)), ] x[1, ] } TDAAtest %>% filter(!is.na(State)) %>% group_by(State) %>% do(maxFUN(.))
This one is highly recommended for more complex operations. For instance, you can return top n (topN
) counties per state by having x[1:topN]
for the returned dataframe in maxFUN
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With