Why does dplyr::distinct behave like this for grouped data frames

Question

My question involves the distinct function from dplyr.

First, set up the data:

set.seed(0)

df <- data.frame(
    x = sample(10, 100, rep = TRUE),
    y = sample(10, 100, rep = TRUE)
)

Consider the following two uses of distinct.

df %>%
    group_by(x) %>%
    distinct()

df %>%
    group_by(x) %>%
    distinct(y)

The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".

Why should this be so when

df %>%
    distinct(x, y)

df %>% distinct()

produce the same result?

EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110

Claus Wilke · Accepted Answer

As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.

Thus:

df %>%
group_by(x) %>%
distinct()

Group by x, find values that are distinct in x(!). This seems to be a bug.

However:

df %>%
group_by(x) %>%
distinct(y)

Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:

df %>%
distinct(x, y)

df %>% distinct()

Both find distinct values in x and y.

The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.

Why does dplyr::distinct behave like this for grouped data frames

Tags:

r

dplyr

Alex

1 Answers

Claus Wilke

Recent Activity

Donate For Us

Why does dplyr::distinct behave like this for grouped data frames

Tags:

r

dplyr

Alex

1 Answers

Claus Wilke

Related questions

Recent Activity

Donate For Us