Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table cutoff row after duplicate

Tags:

r

data.table

Lets say i have the following dataset:

library(data.table)
dt <- data.table(x = c(1, 2, 4, 5, 2, 3, 4))

> dt
   x
1: 1
2: 2
3: 4
4: 5
5: 2
6: 3
7: 4

I would like to cutoff after the 4th row since then the first duplicate (number 2) occurs.

Expected Output:

   x
1: 1
2: 2
3: 4
4: 5

Needless to say, I am not looking for dt[1:4, ,][] as the real dataset more "complicated".

I tried around with shift(), .I, but it didnt work. One idea was: dt[x %in% dt$x[1:(.I - 1)], .SD, ][].

like image 859
Tlatwork Avatar asked Dec 19 '22 04:12

Tlatwork


2 Answers

Perhaps we can use duplicated

dt[seq_len(which(duplicated(x))[1]-1)]
#   x
#1: 1
#2: 2
#3: 4
#4: 5

Or as @lmo suggested

dt[seq_len(which.max(duplicated(dt))-1)]
like image 66
akrun Avatar answered Jan 13 '23 10:01

akrun


Here's another option:

dt[seq_len(anyDuplicated(x)-1L)]

From the help files:

anyDuplicated(): an integer or real vector of length one with value the 1-based index of the first duplicate if any, otherwise 0.

But note that if you don't have any duplicate in the column, you may run into problems with this approach (and the other approach currently posted).

To take care of that, you can modify it to:

dt[if((ix <- anyDuplicated(x)-1L) > 0) seq_len(ix) else seq_len(.N)]

This will return all rows if no duplicate is found or if there is a duplicate only until the row before the first duplicate.

like image 43
talat Avatar answered Jan 13 '23 08:01

talat