This is hard for me to explain, so I will just give an example instead. I have two vectors below (a & b).
a <- c("cat","dog","banana","yogurt","dog")
b <- c("salamander","worm","dog","banana","cat","yellow","blue")
What I would like is the following results:
[1] 0 0 2 1 1 0 0
where each element of the result is the number of times each element of b appears in the vector a.
do.call("c",lapply(b,function(x){sum(x == a)}))
This gives me what I want, but I need a vectorized/faster version of this because I am working with >20,000 records. Any help appreciated!
You could use outer
with colSums
:
colSums(outer(a, b, `==`))
[1] 0 0 2 1 1 0 0
You can do:
res <- table(factor(b, levels=b)[match(a, b, nomatch=0)])
salamander worm dog banana cat yellow blue
0 0 2 1 1 0 0
If you want a vanilla vector, there's as.vector(res)
.
Comments
==
as in some other answers. The steps are pretty similar to @GKi's double merge: find where the vectors match, then map back to b.Benchmarks
Required packages: data.table, purrr, microbenchmark
Various options
library(data.table)
# NelsonGon's answer
purrem <- function() purrr::map_dbl(b, ~sum(.x==a))
# Andrew's answer
vappem <- function() vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))
# Andrew's answer
collem <- function() colSums(outer(a, b, `==`))
# arg0naut91's answer
lappem <- function() unlist(lapply(b, function(x) sum(x == a)))
# this answer
matchem <- function() table(factor(b, levels=b)[match(a, b, nomatch=0)])
# this answer + data.table
matchem2<- function()
setDT(list(b))[, n := 0L][setDT(list(a))[, .N, by=V1], on=.(V1), n := N]$n
# @GKi's answer
mergem <- function() merge(b, table(merge(a, b, by=1)), by=1, all.x=T)[,2]
Example input and benchmarking code
nv = 1e4 # values that can appear in a
nb = 1e3 # values to look up, nb <= na
na = 1e5 # length of a
set.seed(1)
a <- sample(nv, na, replace=TRUE)
b <- seq_len(nb)
microbenchmark::microbenchmark(times = 10,
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem()
)
# make sure results match
# left as an exercise for the cautious user
identical(as.vector(mat_res), lap_res) # ok
identical(as.integer(col_res), lap_res) # ok
# etc
Results
Unit: milliseconds
expr min lq mean median uq max neval
pur_res <- purrem() 373.488498 389.331825 479.039835 430.363183 500.948370 858.77997 10
vap_res <- vappem() 367.247322 397.516902 472.635368 505.782597 532.951841 570.68548 10
col_res <- collem() 1353.356494 1481.029982 1507.536324 1515.966781 1552.886597 1650.93967 10
lap_res <- lappem() 352.197701 394.562073 469.988534 507.935397 525.426475 559.56388 10
mat_res <- matchem() 3.032507 3.230309 5.101941 3.371101 3.874484 15.31595 10
mat_res2 <- matchem2() 7.591947 11.666453 12.809046 12.266796 13.676658 22.04095 10
mer_res <- mergem() 23.448314 23.712974 27.730525 24.547323 24.716967 46.92548 10
If it takes under a second, fits in memory and is run once, choosing among these options probably isn't too important. The ranking among the not-slow options probably depends on the parameters of the OP's actual problem (which nv, na, nb can hopefully be adjusted to approximate here).
Feel free to edit in more options and rerun, copying your results over mine here. For example, I couldn't get @NelsonGon's stringi approach to work with these parameters, but maybe someone else has more patience or a more powerful computer. I'd also be curious to see memory usage, but haven't learned the packages that support measuring it yet.
If there is some nv/na/nb configuration where one answer performs particularly well, editing that answer with a similar benchmark highlighting that case is an option.
Just FYI:
bench::mark(
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem(),
stringi <- sapply(b, function(x) sum(stringi::stri_count(x, regex=a))),
check=FALSE
)
# A tibble: 8 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 pur_res <- purrem() 421.14ms 424.65ms 424.65ms 428.15ms 2.35 382.21MB 0 2 849.29ms <dbl [1,0~ <Rprofmem [2,1~ <bch:~ <tibble [2 ~
2 vap_res <- vappem() 367.88ms 370.61ms 370.61ms 373.34ms 2.70 381.52MB 0 2 741.23ms <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
3 col_res <- collem() 1.64s 1.64s 1.64s 1.64s 0.608 1.12GB 2 1 1.64s <dbl [1,0~ <Rprofmem [32 ~ <bch:~ <tibble [1 ~
4 lap_res <- lappem() 411.25ms 506.67ms 506.67ms 602.1ms 1.97 381.53MB 3 2 1.01s <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
5 mat_res <- matchem() 3.11ms 3.48ms 3.44ms 5.79ms 287. 1.4MB 0 144 501.66ms <S3: tabl~ <Rprofmem [90 ~ <bch:~ <tibble [14~
6 mat_res2 <- matchem2() 5.22ms 6.26ms 5.96ms 27.7ms 160. 4.83MB 1 80 501.18ms <int [1,0~ <Rprofmem [435~ <bch:~ <tibble [80~
7 mer_res <- mergem() 19.88ms 22.75ms 22.02ms 33.6ms 44.0 6.59MB 1 23 523.3ms <int [1,0~ <Rprofmem [410~ <bch:~ <tibble [23~
8 stringi <- sapply(b, function(x) sum(string~ 6.57m 6.57m 6.57m 6.57m 0.00254 1.12GB 1 1 6.57m <int [1,0~ <Rprofmem [2,3~ <bch:~ <tibble [1 ~
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With