I have the following <code>data.table</code>: <pre class="prettyprint"><code>library(data.table) dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4), c(4, 4, 4, 5, 5, 6, 7, 4, 5)) V1 V2 1: 1 4 2: 1 4 3: 1 4 4: 2 5 5: 2 5 6: 2 6 7: 2 7 8: 3 4 9: 4 5 </code></pre> I want to study the different values of <code>V2</code> for a given <code>V1</code>. However, if all values of <code>V2</code> for a given <code>V1</code> are the same, that doesn't interest me, so I want to remove such rows. Looking at the example above, the first three rows are perfectly identical (<code>V1=1</code>, <code>V2=4</code>), so I wish to remove them. However, the next four rows include two identical rows and others with a different <code>V2</code>. In this case, I want to show the three possible values of <code>V2</code> given <code>V1 = 2</code>: <code>(2, 5)</code>, <code>(2, 6)</code> and <code>(2, 7)</code>. The last two rows have unique <code>V1</code>: that falls under the category of "all rows are perfectly identical", and so should be removed as well. The best I could think of is shown in this answer: <pre class="prettyprint"><code>dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ] V1 V2 1: 2 6 2: 2 7 3: 3 4 4: 4 5 </code></pre> Which obviously isn't satisfactory: it removes the <code>(2,5)</code> pair, since it is duplicated, and it keeps the <code>(3,4)</code> and <code>(4,5)</code> pairs since they're unique and therefore not flagged by either <code>duplicated()</code> pass. The other option would be simply calling <pre class="prettyprint"><code>unique(dt) V1 V2 1: 1 4 2: 2 5 3: 2 6 4: 2 7 5: 3 4 6: 4 5 </code></pre> But it keeps the <code>(1,4)</code>, <code>(3,4)</code>, <code>(4,5)</code> pairs I want removed. In the end, the result I'm looking for is: <pre class="prettyprint"><code> V1 V2 1: 2 5 2: 2 6 3: 2 7 </code></pre> Though any other format is also acceptable, such as: <pre class="prettyprint"><code> V1 V2.1 V2.2 V2.3 1: 2 5 6 7 </code></pre> (which shows the possible values of <code>V2</code> for each "interesting" <code>V1</code>) I can't figure out how to differentiate the <code>(1,4)</code> case (all rows are the same) from the <code>(2,5)</code> case (there are some duplicates, but there are other rows with the same <code>V1</code>, so we must remove the duplicate <code>(2,5)</code> but leave one copy). As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the <code>unique</code> <pre class="prettyprint"><code>unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1]) # V1 V2 #1: 2 5 #2: 2 6 #3: 2 7 </code></pre> Or as @r2evans mentioned <pre class="prettyprint"><code>unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"]) </code></pre> NOTE: The OP's dataset is <code>data.table</code> and <code>data.table</code> methods are the natural way of doing it <hr> If we need a <code>tidyverse</code> option, a comparable one to the above <code>data.table</code> option is <pre class="prettyprint"><code>library(dplyr) dt %>% group_by(V1) %>% filter(n_distinct(V2) > 1) %>% distinct() </code></pre>

Remove *all* duplicate rows, unless there's a "similar" row

I have the following data.table:

library(data.table)
dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4),
                c(4, 4, 4, 5, 5, 6, 7, 4, 5))
   V1 V2
1:  1  4
2:  1  4
3:  1  4
4:  2  5
5:  2  5
6:  2  6
7:  2  7
8:  3  4
9:  4  5

I want to study the different values of V2 for a given V1. However, if all values of V2 for a given V1 are the same, that doesn't interest me, so I want to remove such rows.

Looking at the example above, the first three rows are perfectly identical (V1=1, V2=4), so I wish to remove them.

However, the next four rows include two identical rows and others with a different V2. In this case, I want to show the three possible values of V2 given V1 = 2: (2, 5), (2, 6) and (2, 7).

The last two rows have unique V1: that falls under the category of "all rows are perfectly identical", and so should be removed as well.

The best I could think of is shown in this answer:

dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ]
   V1 V2
1:  2  6
2:  2  7
3:  3  4
4:  4  5

Which obviously isn't satisfactory: it removes the (2,5) pair, since it is duplicated, and it keeps the (3,4) and (4,5) pairs since they're unique and therefore not flagged by either duplicated() pass.

The other option would be simply calling

unique(dt)
   V1 V2
1:  1  4
2:  2  5
3:  2  6
4:  2  7
5:  3  4
6:  4  5

But it keeps the (1,4), (3,4), (4,5) pairs I want removed.

In the end, the result I'm looking for is:

Though any other format is also acceptable, such as:

   V1 V2.1 V2.2 V2.3
1:  2    5    6    7

(which shows the possible values of V2 for each "interesting" V1)

I can't figure out how to differentiate the (1,4) case (all rows are the same) from the (2,5) case (there are some duplicates, but there are other rows with the same V1, so we must remove the duplicate (2,5) but leave one copy).

As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.

When you remove duplicates does it remove the entire row?

Delete rows based on duplicates in one column with Remove Duplicates feature. This method will introduce the Remove Duplicates feature to remove entire rows based on duplicates in one column easily in Excel.

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique

unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
#   V1 V2
#1:  2  5
#2:  2  6
#3:  2  7

Or as @r2evans mentioned

unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])

NOTE: The OP's dataset is data.table and data.table methods are the natural way of doing it

If we need a tidyverse option, a comparable one to the above data.table option is

library(dplyr)
dt %>%
   group_by(V1) %>% 
   filter(n_distinct(V2) > 1) %>% 
   distinct()

Remove all duplicate rows, unless there's a "similar" row

Tags:

r

data.table

Wasabi

People also ask

1 Answers

akrun

Recent Activity

Donate For Us