Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a vector in R of counts for number of times each element appears in another vector

Tags:

r

This is hard for me to explain, so I will just give an example instead. I have two vectors below (a & b).

a <- c("cat","dog","banana","yogurt","dog")
b <- c("salamander","worm","dog","banana","cat","yellow","blue")

What I would like is the following results:

[1] 0 0 2 1 1 0 0 

where each element of the result is the number of times each element of b appears in the vector a.

do.call("c",lapply(b,function(x){sum(x == a)}))

This gives me what I want, but I need a vectorized/faster version of this because I am working with >20,000 records. Any help appreciated!

like image 655
AyeTown Avatar asked Jul 17 '19 15:07

AyeTown


2 Answers

You could use outer with colSums:

colSums(outer(a, b, `==`))
[1] 0 0 2 1 1 0 0
like image 191
Andrew Avatar answered Nov 14 '22 04:11

Andrew


You can do:

res <- table(factor(b, levels=b)[match(a, b, nomatch=0)])

salamander       worm        dog     banana        cat     yellow       blue 
         0          0          2          1          1          0          0 

If you want a vanilla vector, there's as.vector(res).


Comments

  • (Thanks to @HectorHaffenden) This approach assumes that all values in b are distinct.
  • I expect this to be faster than making exhaustive comparisons with == as in some other answers. The steps are pretty similar to @GKi's double merge: find where the vectors match, then map back to b.

Benchmarks

Required packages: data.table, purrr, microbenchmark

Various options

library(data.table)
# NelsonGon's answer
purrem <- function() purrr::map_dbl(b, ~sum(.x==a))
# Andrew's answer
vappem <- function() vapply(b, function(x) sum(x == a), FUN.VALUE = integer(1))
# Andrew's answer
collem <- function() colSums(outer(a, b, `==`)) 
# arg0naut91's answer
lappem  <- function() unlist(lapply(b, function(x) sum(x == a)))
# this answer
matchem <- function() table(factor(b, levels=b)[match(a, b, nomatch=0)])
# this answer + data.table
matchem2<- function() 
  setDT(list(b))[, n := 0L][setDT(list(a))[, .N, by=V1], on=.(V1), n := N]$n
# @GKi's answer
mergem <- function() merge(b, table(merge(a, b, by=1)), by=1, all.x=T)[,2]

Example input and benchmarking code

nv = 1e4 # values that can appear in a
nb = 1e3 # values to look up, nb <= na
na = 1e5 # length of a

set.seed(1)
a <- sample(nv, na, replace=TRUE) 
b <- seq_len(nb)

microbenchmark::microbenchmark(times = 10,
pur_res <- purrem(),
vap_res <- vappem(),
col_res <- collem(),
lap_res <- lappem(),
mat_res <- matchem(),
mat_res2<- matchem2(),
mer_res <- mergem()
)

# make sure results match
# left as an exercise for the cautious user
identical(as.vector(mat_res), lap_res) # ok
identical(as.integer(col_res), lap_res) # ok
# etc

Results

Unit: milliseconds
                   expr         min          lq        mean      median          uq        max neval
    pur_res <- purrem()  373.488498  389.331825  479.039835  430.363183  500.948370  858.77997    10
    vap_res <- vappem()  367.247322  397.516902  472.635368  505.782597  532.951841  570.68548    10
    col_res <- collem() 1353.356494 1481.029982 1507.536324 1515.966781 1552.886597 1650.93967    10
    lap_res <- lappem()  352.197701  394.562073  469.988534  507.935397  525.426475  559.56388    10
   mat_res <- matchem()    3.032507    3.230309    5.101941    3.371101    3.874484   15.31595    10
 mat_res2 <- matchem2()    7.591947   11.666453   12.809046   12.266796   13.676658   22.04095    10
    mer_res <- mergem()   23.448314   23.712974   27.730525   24.547323   24.716967   46.92548    10

If it takes under a second, fits in memory and is run once, choosing among these options probably isn't too important. The ranking among the not-slow options probably depends on the parameters of the OP's actual problem (which nv, na, nb can hopefully be adjusted to approximate here).

Feel free to edit in more options and rerun, copying your results over mine here. For example, I couldn't get @NelsonGon's stringi approach to work with these parameters, but maybe someone else has more patience or a more powerful computer. I'd also be curious to see memory usage, but haven't learned the packages that support measuring it yet.

If there is some nv/na/nb configuration where one answer performs particularly well, editing that answer with a similar benchmark highlighting that case is an option.


Just FYI:

bench::mark(
    pur_res <- purrem(),
    vap_res <- vappem(),
    col_res <- collem(),
    lap_res <- lappem(),
    mat_res <- matchem(),
    mat_res2<- matchem2(),
    mer_res <- mergem(),
    stringi <- sapply(b, function(x) sum(stringi::stri_count(x, regex=a))),
    check=FALSE
)

# A tibble: 8 x 14
  expression                                          min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result     memory          time   gc          
  <chr>                                          <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>     <list>          <list> <list>      
1 pur_res <- purrem()                            421.14ms 424.65ms 424.65ms 428.15ms   2.35     382.21MB     0     2   849.29ms <dbl [1,0~ <Rprofmem [2,1~ <bch:~ <tibble [2 ~
2 vap_res <- vappem()                            367.88ms 370.61ms 370.61ms 373.34ms   2.70     381.52MB     0     2   741.23ms <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
3 col_res <- collem()                               1.64s    1.64s    1.64s    1.64s   0.608      1.12GB     2     1      1.64s <dbl [1,0~ <Rprofmem [32 ~ <bch:~ <tibble [1 ~
4 lap_res <- lappem()                            411.25ms 506.67ms 506.67ms  602.1ms   1.97     381.53MB     3     2      1.01s <int [1,0~ <Rprofmem [1,0~ <bch:~ <tibble [2 ~
5 mat_res <- matchem()                             3.11ms   3.48ms   3.44ms   5.79ms 287.          1.4MB     0   144   501.66ms <S3: tabl~ <Rprofmem [90 ~ <bch:~ <tibble [14~
6 mat_res2 <- matchem2()                           5.22ms   6.26ms   5.96ms   27.7ms 160.         4.83MB     1    80   501.18ms <int [1,0~ <Rprofmem [435~ <bch:~ <tibble [80~
7 mer_res <- mergem()                             19.88ms  22.75ms  22.02ms   33.6ms  44.0        6.59MB     1    23    523.3ms <int [1,0~ <Rprofmem [410~ <bch:~ <tibble [23~
8 stringi <- sapply(b, function(x) sum(string~      6.57m    6.57m    6.57m    6.57m   0.00254    1.12GB     1     1      6.57m <int [1,0~ <Rprofmem [2,3~ <bch:~ <tibble [1 ~
like image 21
Frank Avatar answered Nov 14 '22 03:11

Frank