I have the following data.table
:
library(data.table)
dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4),
c(4, 4, 4, 5, 5, 6, 7, 4, 5))
V1 V2
1: 1 4
2: 1 4
3: 1 4
4: 2 5
5: 2 5
6: 2 6
7: 2 7
8: 3 4
9: 4 5
I want to study the different values of V2
for a given V1
. However, if all values of V2
for a given V1
are the same, that doesn't interest me, so I want to remove such rows.
Looking at the example above, the first three rows are perfectly identical (V1=1
, V2=4
), so I wish to remove them.
However, the next four rows include two identical rows and others with a different V2
. In this case, I want to show the three possible values of V2
given V1 = 2
: (2, 5)
, (2, 6)
and (2, 7)
.
The last two rows have unique V1
: that falls under the category of "all rows are perfectly identical", and so should be removed as well.
The best I could think of is shown in this answer:
dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ]
V1 V2
1: 2 6
2: 2 7
3: 3 4
4: 4 5
Which obviously isn't satisfactory: it removes the (2,5)
pair, since it is duplicated, and it keeps the (3,4)
and (4,5)
pairs since they're unique and therefore not flagged by either duplicated()
pass.
The other option would be simply calling
unique(dt)
V1 V2
1: 1 4
2: 2 5
3: 2 6
4: 2 7
5: 3 4
6: 4 5
But it keeps the (1,4)
, (3,4)
, (4,5)
pairs I want removed.
In the end, the result I'm looking for is:
V1 V2
1: 2 5
2: 2 6
3: 2 7
Though any other format is also acceptable, such as:
V1 V2.1 V2.2 V2.3
1: 2 5 6 7
(which shows the possible values of V2
for each "interesting" V1
)
I can't figure out how to differentiate the (1,4)
case (all rows are the same) from the (2,5)
case (there are some duplicates, but there are other rows with the same V1
, so we must remove the duplicate (2,5)
but leave one copy).
As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.
Delete rows based on duplicates in one column with Remove Duplicates feature. This method will introduce the Remove Duplicates feature to remove entire rows based on duplicates in one column easily in Excel.
An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique
unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
# V1 V2
#1: 2 5
#2: 2 6
#3: 2 7
Or as @r2evans mentioned
unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])
NOTE: The OP's dataset is data.table
and data.table
methods are the natural way of doing it
If we need a tidyverse
option, a comparable one to the above data.table
option is
library(dplyr)
dt %>%
group_by(V1) %>%
filter(n_distinct(V2) > 1) %>%
distinct()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With