data.table cutoff row after duplicate

Question

Lets say i have the following dataset:

library(data.table)
dt <- data.table(x = c(1, 2, 4, 5, 2, 3, 4))

> dt
   x
1: 1
2: 2
3: 4
4: 5
5: 2
6: 3
7: 4

I would like to cutoff after the 4th row since then the first duplicate (number 2) occurs.

Expected Output:

   x
1: 1
2: 2
3: 4
4: 5

Needless to say, I am not looking for dt[1:4, ,][] as the real dataset more "complicated".

I tried around with shift(), .I, but it didnt work. One idea was: dt[x %in% dt$x[1:(.I - 1)], .SD, ][].

akrun · Accepted Answer

Perhaps we can use duplicated

dt[seq_len(which(duplicated(x))[1]-1)]
#   x
#1: 1
#2: 2
#3: 4
#4: 5

Or as @lmo suggested

dt[seq_len(which.max(duplicated(dt))-1)]

talat · Answer

Here's another option:

dt[seq_len(anyDuplicated(x)-1L)]

From the help files:

anyDuplicated(): an integer or real vector of length one with value the 1-based index of the first duplicate if any, otherwise 0.

But note that if you don't have any duplicate in the column, you may run into problems with this approach (and the other approach currently posted).

To take care of that, you can modify it to:

dt[if((ix <- anyDuplicated(x)-1L) > 0) seq_len(ix) else seq_len(.N)]

This will return all rows if no duplicate is found or if there is a duplicate only until the row before the first duplicate.

Donate For Us