Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.
Example dataframe named "a":
A B
1 Y
2 N
3 Y
1 Y
a %>% summarise(count = n_distinct(A))
However I also want to add a count of n_distinct(A)
where B == "Y"
The result should be:
count
3
when you add the condition the result should be:
count
2
The end result I am trying to achieve is both statements merged into one call that gives me a result like
count_all count_BisY
3 2
What is the appropriate way to go about this with dplyr?
This produces the distinct A counts by each value of B using dplyr.
library(dplyr)
a %>%
group_by(B) %>%
summarise(count = n_distinct(A))
This produces the result:
Source: local data frame [2 x 2]
B count
(fctr) (int)
1 N 1
2 Y 2
To produce the desired output added above using dplyr, you can do the following:
a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))
This produces the result:
count_all count_BisY
1 3 2
An alternative is to use the uniqueN
function from data.table inside dplyr:
library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))
which gives:
count_all count_BisY
1 3 2
You can also do everything with data.table:
library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]
which gives the same result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With