I have two very long vectors:
a <- sample(1e+08L, size = 1e+09L, replace = TRUE)
b <- sample(1e+08L, size = 1e+09L, replace = TRUE)
I want to generate an integer vector r of length length(a) such that r[i] is the index of a[i] in b.
I tried pmatch(a, b) but it is very slow. Is there a more efficient way?
Desired output for a small example:
a <- c(1, 3, 5, 7, 8)
b <- c(3, 1, 7, 8, 5)
f(a, b)
## [1] 2 1 5 3 4
Your question mentions pmatch, which performs partial matching of character vectors, but it seems like you want match, which performs exact matching of integer and other vectors.
match is faster, but even faster than match is fastmatch::fmatch:
match(b, a)
fastmatch::fmatch(b, a)
Adding to the benchmarks:
library(fastmatch)
set.seed(1)
a <- sample(1e5, 1e5)
b <- sample(1e5, 1e4)
microbenchmark::microbenchmark(
match = match(b, a),
fastmatch = fmatch(b, a),
check = "identical",
times = 100L)
Unit: microseconds
expr min lq mean median uq max neval
match 9439.020 9500.602 9659.7537 9519.6260 9555.0090 12394.546 100
fastmatch 367.606 376.134 398.4347 382.0175 399.1145 614.467 100
Benchmark:
library(fastmatch) #fmatch
library(data.table) #merge
library(collections) #hash
Rcpp::cppFunction('IntegerVector matchC(NumericVector x, NumericVector table) {
IntegerVector out(x.size(), NA_INTEGER);
for(int i = 0; i < x.size(); i++) {
for(int j = 0; j < table.size(); j++) {
if(x[i] == table[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
Rcpp::cppFunction('IntegerVector matchC2(const std::vector<int>& x, const std::vector<int>& table) {
IntegerVector out(x.size(), NA_INTEGER);
std::unordered_map<int, int> lut;
lut.max_load_factor(table.size()/(double)*max_element(table.begin(), table.end()));
lut.reserve(table.size());
for(int i = 0; i < table.size(); i++) lut[table[i]] = i+1;
for(int i = 0; i < x.size(); i++) {
auto search = lut.find(x[i]);
if(search != lut.end()) out[i] = search->second;
}
return out;
}')
set.seed(1); a <- sample(1e5, 1e5); b <- sample(1e5, 1e4)
bench::mark(
match = { match(a, b) },
fmatch = { fmatch(a, b) },
zx8754.merge = {
merge(data.table(x = a, rnA = seq_along(a), key = "x"),
data.table(x = b, rnB = seq_along(b), key = "x"),
all.x = TRUE)[order(rnA), rnB] },
sotos.Rcpp = { matchC(a, b) },
GKi.Rcpp = { matchC2(a, b) },
user2974951.hash = {
h = dict(seq_along(b), b)
sapply(a, h$get, default = NA)},
"jblood94.[" = `[<-`(NA_integer_, b, seq_along(b))[a]
)
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 match 1.86ms 1.99ms 479. 951.3KB 7.98 240 4
2 fmatch 1.03ms 1.12ms 896. 393.34KB 6.00 448 3
3 zx8754.merge 4.71ms 5.02ms 181. 8.05MB 31.6 92 16
4 sotos.Rcpp 2.38s 2.38s 0.420 1.22MB 0 1 0
5 GKi.Rcpp 891.93µs 945.6µs 1018. 393.16KB 7.98 510 4
6 user2974951.hash 127.58ms 133.66ms 6.85 3.44MB 30.8 4 18
7 jblood94.[ 227.28µs 244.8µs 3193. 800.85KB 48.0 1597 24
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With