Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index consecutive duplicates in vector

Tags:

r

vector

What is the optimal way to get the index of all elements that are repeated # times? I want to identify the elements that are duplicated more than 2 times. rle() and rleid() both hint to the values I need but neither method directly gives me the indices.

I came up with this code:

t1 <- c(1, 10, 10, 10, 14, 37, 3, 14, 8, 8, 8, 8, 39, 12)

t2 <- lag(t1,1)
t2[is.na(t2)] <- 0
t3 <- ifelse(t1 - t2 == 0, 1, 0)
t4 <- rep(0, length(t3))
for (i in 2:length(t3)) t4[i] <- ifelse(t3[i] > 0, t3[i - 1] + t3[i], 0)

which(t4 > 1)

returns:

[1]  4 11 12 

and those are the values I need.

Are there any R-functions that are more appropriate?

Ben

like image 240
Ben Engbers Avatar asked Jun 24 '19 13:06

Ben Engbers


2 Answers

One option with data.table. No real reason to use this instead of lag/shift when n = 2, but for larger n this would save you from creating a large number of new lagged vectors.

library(data.table)

which(rowid(rleid(t1)) > 2)
# [1]  4 11 12

Explanation:

rleid will produce a unique value for each "run" of equal values, and rowid will mark how many elements "into" the run each element is. What you want is elements more than 2 "into" a run.

data.table(
  t1,
  rleid(t1),
  rowid(t1))

#     t1 V2 V3
#  1:  1  1  1
#  2: 10  2  1
#  3: 10  2  2
#  4: 10  2  3
#  5: 14  3  1
#  6: 37  4  1
#  7:  3  5  1
#  8: 14  6  2
#  9:  8  7  1
# 10:  8  7  2
# 11:  8  7  3
# 12:  8  7  4
# 13: 39  8  1
# 14: 12  9  1

Edit: If, as in the example posed by this question, no two runs (even length-1 "runs") are of the same value (or if you don't care whether the duplicates are next to eachother), you can just use which(rowid(t1) > 2) instead. (This is noted by Frank in the comments)

Hopefully this example clarifies the differences

a <- c(1, 1, 1, 2, 2, 1)
which(rowid(a) > 2)
# [1] 3 6
which(rowid(rleid(a)) > 2)
# [1] 3
like image 177
IceCreamToucan Avatar answered Nov 13 '22 00:11

IceCreamToucan


You can use dplyr::lag or data.table::shift (note, default for shift is to lag, so shift(t1, 1) is equal to shift(t1, 1, type = "lag"):

which(t1 == lag(t1, 1) & lag(t1, 1) == lag(t1, 2))
[1]  4 11 12
# Or
which(t1 == shift(t1, 1) & shift(t1, 1) == shift(t1, 2))
[1]  4 11 12

If you need it to scale for several duplicates you can do the following (thanks for the tip @IceCreamToucan):

n <- 2
df1 <- sapply(0:n, function(x) shift(t1, x))
which(rowMeans(df1 == df1[,1]) == 1)
[1]  4 11 12
like image 6
Andrew Avatar answered Nov 13 '22 02:11

Andrew