Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine group_by and distinct

Tags:

r

dplyr

I have a data.frame with two variables id.x and id.y whose combination uniquely identifies each row but are repeated many times in the dataset.

I would like to use dplyr to group_by id.x such that each id.x is matched with a distinct id.y.

edit edited example to highlight the differing number of unique id.x. and id.y

An example:

  id.x id.y
    a    o
    a    p
    a    q
    c    o
    c    p
    c    q

Would return:

 id.x id.y
    a    o
    c    q

dput for example:

structure(list(id.x = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"c"), class = "factor"), id.y = structure(c(1L, 2L, 3L, 1L, 2L, 
3L), .Label = c("o", "p", "q"), class = "factor")), .Names = c("id.x", 
"id.y"), row.names = c(NA, -6L), class = "data.frame")

edit If my desired result could be accomplished without the use of group_by or distinct that is fine too! I also use data.table, and a data.table solution would be fine.

like image 347
bjoseph Avatar asked Jun 11 '15 17:06

bjoseph


People also ask

Can I use distinct and GROUP BY together?

Well, GROUP BY and DISTINCT have their own use. GROUP BY cannot replace DISTINCT in some situations and DISTINCT cannot take place of GROUP BY. It is as per your choice and situation how you are optimizing both of them and choosing where to use GROUP BY and DISTINCT.

Can we use distinct and GROUP BY Together in Oracle?

We can use GROUP BY without specifying any aggregate functions in the SELECT list. However, the same result is usually produced by specifying DISTINCT instead of using GROUP BY. According to Tom Kyte the two approaches are effectively equivalent (see AskTom "DISTINCT VS, GROUP BY").

Can we use distinct in GROUP BY clause?

Use DISTINCT to remove duplicate GROUPING SETS from the GROUP BY clause.

What is the difference between distinct and GROUP BY?

GROUP BY lets you use aggregate functions, like AVG , MAX , MIN , SUM , and COUNT . On the other hand DISTINCT just removes duplicates. This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.


1 Answers

Using dplyr

df %>% filter(dense_rank(id.x)==dense_rank(id.y))

which returns

  id.x id.y
1    a    o
2    c    p
like image 116
manotheshark Avatar answered Sep 17 '22 00:09

manotheshark