Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr n_distinct with condition

Tags:

r

dplyr

Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.

Example dataframe named "a":

A B
1 Y
2 N
3 Y
1 Y

a %>% summarise(count = n_distinct(A))

However I also want to add a count of n_distinct(A) where B == "Y"

The result should be:

count
    3

when you add the condition the result should be:

count
    2

The end result I am trying to achieve is both statements merged into one call that gives me a result like

count_all  count_BisY
        3           2

What is the appropriate way to go about this with dplyr?

like image 699
Ryan Castner Avatar asked Jan 06 '16 15:01

Ryan Castner


2 Answers

This produces the distinct A counts by each value of B using dplyr.

library(dplyr)
a %>%
  group_by(B) %>%
  summarise(count = n_distinct(A))

This produces the result:

Source: local data frame [2 x 2]

       B count
  (fctr) (int)
1      N     1
2      Y     2

To produce the desired output added above using dplyr, you can do the following:

a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))

This produces the result:

  count_all count_BisY
1         3          2
like image 81
Gopala Avatar answered Sep 18 '22 14:09

Gopala


An alternative is to use the uniqueN function from data.table inside dplyr:

library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))

which gives:

  count_all count_BisY
1         3          2

You can also do everything with data.table:

library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]

which gives the same result.

like image 33
Jaap Avatar answered Sep 19 '22 14:09

Jaap