Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove *all* duplicate rows, unless there's a "similar" row

Tags:

r

data.table

I have the following data.table:

library(data.table)
dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4),
                c(4, 4, 4, 5, 5, 6, 7, 4, 5))
   V1 V2
1:  1  4
2:  1  4
3:  1  4
4:  2  5
5:  2  5
6:  2  6
7:  2  7
8:  3  4
9:  4  5

I want to study the different values of V2 for a given V1. However, if all values of V2 for a given V1 are the same, that doesn't interest me, so I want to remove such rows.

Looking at the example above, the first three rows are perfectly identical (V1=1, V2=4), so I wish to remove them.

However, the next four rows include two identical rows and others with a different V2. In this case, I want to show the three possible values of V2 given V1 = 2: (2, 5), (2, 6) and (2, 7).

The last two rows have unique V1: that falls under the category of "all rows are perfectly identical", and so should be removed as well.

The best I could think of is shown in this answer:

dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ]
   V1 V2
1:  2  6
2:  2  7
3:  3  4
4:  4  5

Which obviously isn't satisfactory: it removes the (2,5) pair, since it is duplicated, and it keeps the (3,4) and (4,5) pairs since they're unique and therefore not flagged by either duplicated() pass.

The other option would be simply calling

unique(dt)
   V1 V2
1:  1  4
2:  2  5
3:  2  6
4:  2  7
5:  3  4
6:  4  5

But it keeps the (1,4), (3,4), (4,5) pairs I want removed.

In the end, the result I'm looking for is:

   V1 V2
1:  2  5
2:  2  6
3:  2  7

Though any other format is also acceptable, such as:

   V1 V2.1 V2.2 V2.3
1:  2    5    6    7

(which shows the possible values of V2 for each "interesting" V1)

I can't figure out how to differentiate the (1,4) case (all rows are the same) from the (2,5) case (there are some duplicates, but there are other rows with the same V1, so we must remove the duplicate (2,5) but leave one copy).

As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.

like image 323
Wasabi Avatar asked May 23 '19 21:05

Wasabi


People also ask

When you remove duplicates does it remove the entire row?

Delete rows based on duplicates in one column with Remove Duplicates feature. This method will introduce the Remove Duplicates feature to remove entire rows based on duplicates in one column easily in Excel.


1 Answers

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique

unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
#   V1 V2
#1:  2  5
#2:  2  6
#3:  2  7

Or as @r2evans mentioned

unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])

NOTE: The OP's dataset is data.table and data.table methods are the natural way of doing it


If we need a tidyverse option, a comparable one to the above data.table option is

library(dplyr)
dt %>%
   group_by(V1) %>% 
   filter(n_distinct(V2) > 1) %>% 
   distinct()
like image 190
akrun Avatar answered Nov 15 '22 00:11

akrun