Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find fastest way to get all intervals between identical elements in a vector

Tags:

r

Say I have a vector of characters with 8 letters, with each one occurring twice:

x <- rep(LETTERS[1:8],2)
set.seed(1)
y <- sample(x)
y

# [1] "E" "F" "A" "D" "C" "B" "C" "G" "F" "A" "B" "G" "E" "H" "D" "H"

I want to find the intervals between each pair of letters. Here, interval refers to the number of letters between the two identical letters. I can do this manually like this:

abs(diff(which(y=="A")))-1  #6
abs(diff(which(y=="D")))-1  #10
abs(diff(which(y=="H")))-1  #1

I wrote a for loop to do this...

res<-NULL
for(i in 1:8){  res[[i]] <- abs(diff(which(y==LETTERS[i])))-1 }

names(res)<-LETTERS[1:8]
res

# A  B  C  D  E  F  G  H 
# 6  4  1 10 11  6  3  1 

However, I want to use this approach in a randomization process with very long vectors. Speed is essential for this - I wonder if anyone has good ideas for making the fastest approach to this problem as possible.

like image 718
jalapic Avatar asked Apr 26 '15 19:04

jalapic


2 Answers

You'll want to set up an index vector and then do a diff(index vector)-by-group operation.


Here's how it looks in the data.table package:

require(data.table)
yDT <- data.table(y)

yDT[,diff(.I)-1,keyby=y]
#    y V1
# 1: A  6
# 2: B  4
# 3: C  1
# 4: D 10
# 5: E 11
# 6: F  6
# 7: G  3
# 8: H  1

The index vector here is the special (built in) variable .I, that stores the row number.

keyby=y groups by y and sorts the result alphabetically; alternately with by=y, we would see the results sorted by first appearance of the group. (Thanks, @Arun, for pointing this out.)


The analogous solution in base R looks like

tapply(1:length(y),y,diff)-1
# A  B  C  D  E  F  G  H 
# 6  4  1 10 11  6  3  1 
like image 173
Frank Avatar answered Nov 03 '22 01:11

Frank


Using data.table::chmatch is considerably faster.

library(data.table)   
f <- function(x){
  ux <- unique(x)
  out <- length(x) - chmatch(ux, rev(x)) - chmatch(ux, x)
  setNames(out, ux)
}

f(y)
# E  F  A  D  C  B  G  H 
#11  6  6 10  1  4  3  1 

It is about 2x faster than cmpalex.

set.seed(007); xx = sample(rep(make.unique(rep_len(LETTERS, 1e3)), each = 2))
microbenchmark::microbenchmark(cmpalex(xx), f(xx), unit="relative")
#Unit: relative
#        expr      min       lq    mean   median       uq      max neval
# cmpalex(xx) 2.402806 2.366553 2.33802 2.359145 2.324677 2.232852   100
#       f(xx) 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000   100

R version 3.2.0 (2015-04-16)
Running under: Windows 8 x64 (build 9200)   

other attached packages:
[1] data.table_1.9.5
like image 25
Khashaa Avatar answered Nov 03 '22 01:11

Khashaa