Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get dplyr count of distinct in a readable way

I'm new using dplyr, I need to calculate the distinct values in a group. Here's a table example:

data=data.frame(aa=c(1,2,3,4,NA), bb=c('a', 'b', 'a', 'c', 'c')) 

I know I can do things like:

by_bb<-group_by(data, bb, add = TRUE) summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa)) 

But if I want the count of unique elements?

I can do:

  > summarise(by_bb,length(unique(unlist(aa))))    bb length(unique(unlist(aa))) 1  a                          2 2  b                          1 3  c                          2 

and if I want to exclude NAs I cand do:

> summarise(by_bb,length(unique(unlist(aa[!is.na(aa)]))))    bb length(unique(unlist(aa[!is.na(aa)]))) 1  a                                      2 2  b                                      1 3  c                                      1 

But it's a little unreadable for me. Is there a better way to do this kind of summarization?

like image 956
GabyLP Avatar asked Nov 03 '14 18:11

GabyLP


People also ask

How do I count unique values in a column in R?

To get a count of unique values by each column I will use n_distinct from dplyr. Unique values in one column. If it is necessary to do that for all data frame columns then you can use R base functions sapply or lapply. The output will be in different formats.

How do you count with dplyr?

count() lets you quickly count the unique values of one or more variables: df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()) .


1 Answers

How about this option:

data %>%                    # take the data.frame "data"   filter(!is.na(aa)) %>%    # Using "data", filter out all rows with NAs in aa    group_by(bb) %>%          # Then, with the filtered data, group it by "bb"   summarise(Unique_Elements = n_distinct(aa))   # Now summarise with unique elements per group  #Source: local data frame [3 x 2] # #  bb Unique_Elements #1  a               2 #2  b               1 #3  c               1 

Use filter to filter out any rows where aa has NAs, then group the data by column bb and then summarise by counting the number of unique elements of column aa by group of bb.

As you can see I'm making use of the pipe operator %>% which you can use to "pipe" or "chain" commands together when using dplyr. This helps you write easily readable code because it's more natural, e.g. you write code from left to write and top to bottom and not deeply nested from inside out (as in your example code).

Edit:

In the first part of your question, you wrote:

I know I can do things like:

by_bb<-group_by(data, bb, add = TRUE) summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa)) 

Here's another option to do that (applying a number of functions to the same column(s)):

data %>%   filter(!is.na(aa)) %>%   group_by(bb) %>%   summarise_each(funs(mean, max, sum, n_distinct), aa)  #Source: local data frame [3 x 5] # #  bb mean max sum n_distinct #1  a    2   3   4          2 #2  b    2   2   2          1 #3  c    4   4   4          1 
like image 124
talat Avatar answered Oct 13 '22 19:10

talat