Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select all rows which are duplicates except for one column

Tags:

r

tidyverse

I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).

For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.

require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer =  iris%>%semi_join(dups)

> answer 
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           5.1         3.5          1.4         0.2    setosa
2           4.9         3.1          1.5         0.1    setosa
3           4.8         3.0          1.4         0.1    setosa
4           5.1         3.5          1.4         0.3    setosa
5           4.9         3.1          1.5         0.2    setosa
6           4.8         3.0          1.4         0.3    setosa
7           5.8         2.7          5.1         1.9 virginica
8           6.7         3.3          5.7         2.1 virginica
9           6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?

An alternate approach, from at least two other posts, applied to this case would be:

answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]

But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?

e.g. something like iris%>%duplicates(all=TRUE,ignore=Petal.Width)

like image 502
Paul Raftery Avatar asked Jul 12 '18 10:07

Paul Raftery


People also ask

How to select all duplicate rows based on one or two columns?

Following is the query to select all duplicate rows based on one or two columns. Here, we are counting the names appearing more than once i.e. duplicates − mysql> select StudentId from DemoTable where StudentFirstName= (select StudentFirstName from DemoTable having count(StudentFirstName) > 1); This will produce the following output −

How do I remove all duplicate rows from a dataset?

To remove all duplicate rows from our sample dataset (shown in the figure above), follow the steps listed below: Select the entire dataset, along with the column headers. From the Data tab, under the Data Tools group select the Remove Duplicates button.

How do I find duplicates in Excel without a column header?

To select duplicate records without column headers, select the first (upper-left) cell, and press Ctrl + Shift + End to extend the selection to the last cell. Tip. In most cases, the above shortcuts work fine and select filtered (visible) rows only.

How to display all duplicate records in Excel?

To display all duplicate records, i.e. occurrences greater than 1, click the filter arrow in the header of the Occurrences column (the column with the formula), and then click Number Filters > Greater Than. Select " is greater than " in the first box, type 1 in the box next to it,...


1 Answers

iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]

Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | this results in 12 TRUEs, so it returns the expected output.

Or perhaps with dplyr: Basically you group on all variables except Petal.Width, count how much they occur, and filter those which occur more than once.

library(dplyr)
iris %>% 
  group_by_at(vars(-Petal.Width)) %>% 
  filter(n() > 1)

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
 1          5.1         3.5          1.4         0.2    setosa
 2          4.9         3.1          1.5         0.1    setosa
 3          4.8         3.0          1.4         0.1    setosa
 4          5.1         3.5          1.4         0.3    setosa
 5          4.9         3.1          1.5         0.2    setosa
 6          4.8         3.0          1.4         0.3    setosa
 7          5.8         2.7          5.1         1.9 virginica
 8          6.7         3.3          5.7         2.1 virginica
 9          6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica
like image 179
Lennyy Avatar answered Nov 03 '22 22:11

Lennyy