Say I have a vector of characters with 8 letters, with each one occurring twice:
x <- rep(LETTERS[1:8],2)
set.seed(1)
y <- sample(x)
y
# [1] "E" "F" "A" "D" "C" "B" "C" "G" "F" "A" "B" "G" "E" "H" "D" "H"
I want to find the intervals between each pair of letters. Here, interval refers to the number of letters between the two identical letters. I can do this manually like this:
abs(diff(which(y=="A")))-1 #6
abs(diff(which(y=="D")))-1 #10
abs(diff(which(y=="H")))-1 #1
I wrote a for
loop to do this...
res<-NULL
for(i in 1:8){ res[[i]] <- abs(diff(which(y==LETTERS[i])))-1 }
names(res)<-LETTERS[1:8]
res
# A B C D E F G H
# 6 4 1 10 11 6 3 1
However, I want to use this approach in a randomization process with very long vectors. Speed is essential for this - I wonder if anyone has good ideas for making the fastest approach to this problem as possible.
You'll want to set up an index vector and then do a diff(index vector)-by-group operation.
Here's how it looks in the data.table
package:
require(data.table)
yDT <- data.table(y)
yDT[,diff(.I)-1,keyby=y]
# y V1
# 1: A 6
# 2: B 4
# 3: C 1
# 4: D 10
# 5: E 11
# 6: F 6
# 7: G 3
# 8: H 1
The index vector here is the special (built in) variable .I
, that stores the row number.
keyby=y
groups by y
and sorts the result alphabetically; alternately with by=y
, we would see the results sorted by first appearance of the group. (Thanks, @Arun, for pointing this out.)
The analogous solution in base R looks like
tapply(1:length(y),y,diff)-1
# A B C D E F G H
# 6 4 1 10 11 6 3 1
Using data.table::chmatch
is considerably faster.
library(data.table)
f <- function(x){
ux <- unique(x)
out <- length(x) - chmatch(ux, rev(x)) - chmatch(ux, x)
setNames(out, ux)
}
f(y)
# E F A D C B G H
#11 6 6 10 1 4 3 1
It is about 2x faster than cmpalex
.
set.seed(007); xx = sample(rep(make.unique(rep_len(LETTERS, 1e3)), each = 2))
microbenchmark::microbenchmark(cmpalex(xx), f(xx), unit="relative")
#Unit: relative
# expr min lq mean median uq max neval
# cmpalex(xx) 2.402806 2.366553 2.33802 2.359145 2.324677 2.232852 100
# f(xx) 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 100
R version 3.2.0 (2015-04-16)
Running under: Windows 8 x64 (build 9200)
other attached packages:
[1] data.table_1.9.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With