Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does dplyr::distinct behave like this for grouped data frames

Tags:

r

dplyr

My question involves the distinct function from dplyr.

First, set up the data:

set.seed(0)

df <- data.frame(
    x = sample(10, 100, rep = TRUE),
    y = sample(10, 100, rep = TRUE)
)

Consider the following two uses of distinct.

df %>%
    group_by(x) %>%
    distinct()

df %>%
    group_by(x) %>%
    distinct(y)

The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".

Why should this be so when

df %>%
    distinct(x, y)

df %>% distinct()

produce the same result?

EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110

like image 441
Alex Avatar asked Jul 07 '15 04:07

Alex


1 Answers

As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.

Thus:

df %>%
group_by(x) %>%
distinct()

Group by x, find values that are distinct in x(!). This seems to be a bug.

However:

df %>%
group_by(x) %>%
distinct(y)

Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:

df %>%
distinct(x, y)

df %>% distinct()

Both find distinct values in x and y.

The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.

like image 67
Claus Wilke Avatar answered Sep 29 '22 16:09

Claus Wilke