I have two very long vectors:
a <- sample(1e+08L, size = 1e+09L, replace = TRUE)
b <- sample(1e+08L, size = 1e+09L, replace = TRUE)
I want to generate an integer vector r
of length length(a)
such that r[i]
is the index of a[i]
in b
.
I tried pmatch(a, b)
but it is very slow. Is there a more efficient way?
Desired output for a small example:
a <- c(1, 3, 5, 7, 8)
b <- c(3, 1, 7, 8, 5)
f(a, b)
## [1] 2 1 5 3 4
Your question mentions pmatch
, which performs partial matching of character vectors, but it seems like you want match
, which performs exact matching of integer and other vectors.
match
is faster, but even faster than match
is fastmatch::fmatch
:
match(b, a)
fastmatch::fmatch(b, a)
Adding to the benchmarks:
library(fastmatch)
set.seed(1)
a <- sample(1e5, 1e5)
b <- sample(1e5, 1e4)
microbenchmark::microbenchmark(
match = match(b, a),
fastmatch = fmatch(b, a),
check = "identical",
times = 100L)
Unit: microseconds
expr min lq mean median uq max neval
match 9439.020 9500.602 9659.7537 9519.6260 9555.0090 12394.546 100
fastmatch 367.606 376.134 398.4347 382.0175 399.1145 614.467 100
Benchmark:
library(fastmatch) #fmatch
library(data.table) #merge
library(collections) #hash
Rcpp::cppFunction('IntegerVector matchC(NumericVector x, NumericVector table) {
IntegerVector out(x.size(), NA_INTEGER);
for(int i = 0; i < x.size(); i++) {
for(int j = 0; j < table.size(); j++) {
if(x[i] == table[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
Rcpp::cppFunction('IntegerVector matchC2(const std::vector<int>& x, const std::vector<int>& table) {
IntegerVector out(x.size(), NA_INTEGER);
std::unordered_map<int, int> lut;
lut.max_load_factor(table.size()/(double)*max_element(table.begin(), table.end()));
lut.reserve(table.size());
for(int i = 0; i < table.size(); i++) lut[table[i]] = i+1;
for(int i = 0; i < x.size(); i++) {
auto search = lut.find(x[i]);
if(search != lut.end()) out[i] = search->second;
}
return out;
}')
set.seed(1); a <- sample(1e5, 1e5); b <- sample(1e5, 1e4)
bench::mark(
match = { match(a, b) },
fmatch = { fmatch(a, b) },
zx8754.merge = {
merge(data.table(x = a, rnA = seq_along(a), key = "x"),
data.table(x = b, rnB = seq_along(b), key = "x"),
all.x = TRUE)[order(rnA), rnB] },
sotos.Rcpp = { matchC(a, b) },
GKi.Rcpp = { matchC2(a, b) },
user2974951.hash = {
h = dict(seq_along(b), b)
sapply(a, h$get, default = NA)},
"jblood94.[" = `[<-`(NA_integer_, b, seq_along(b))[a]
)
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 match 1.86ms 1.99ms 479. 951.3KB 7.98 240 4
2 fmatch 1.03ms 1.12ms 896. 393.34KB 6.00 448 3
3 zx8754.merge 4.71ms 5.02ms 181. 8.05MB 31.6 92 16
4 sotos.Rcpp 2.38s 2.38s 0.420 1.22MB 0 1 0
5 GKi.Rcpp 891.93µs 945.6µs 1018. 393.16KB 7.98 510 4
6 user2974951.hash 127.58ms 133.66ms 6.85 3.44MB 30.8 4 18
7 jblood94.[ 227.28µs 244.8µs 3193. 800.85KB 48.0 1597 24
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With