I am looking for the fastest way in R to add an element (of type character) to a vector if it doesn't already exist. Right now I have simply
vect=c("a","b","c")
vect=unique(c(vect,"b"))
vect=unique(c(vect,"d"))
etc
but I presume there must be better ways of doing this. Any thoughts? (my vector has about 2 million strings (web URLs) )
cheers, Tom
The %chin%
operator from data.table
is specially written to be fast for character vectors. Here is an example:
# Your data, and we would like to add elements from add
# that are not already in vect
vect <- c("a","b","c")
add <- c( "a" , "d" , "e" , "b" )
# Load package
require( data.table )
# %chin% operator is smae as %in% but fast and optimised for character sequences
c( vect , add[ ! add %chin% vect ] )
[1] "a" "b" "c" "d" "e"
Apparently, you want the union of two vectors:
vect <- c("a","b","c")
add <- c( "a" , "d" , "e" , "b" )
union(vect, add)
#[1] "a" "b" "c" "d" "e"
Which, as Simon points out, is the same as your solution.
Here are some benchmarks:
library(data.table)
library(microbenchmark)
microbenchmark(union(vect, add),c( vect , add[ ! add %chin% vect ] ),times=10)
# Unit: microseconds
# expr min lq median uq max neval
# union(vect, add) 12.628 13.243 13.3980 15.092 65.599 10
# c(vect, add[!add %chin% vect]) 2.773 3.080 3.3885 4.620 51.740 10
vect <- as.character(seq_len(1e6))
microbenchmark(union(vect, add),c( vect , add[ ! add %chin% vect ] ),times=10)
#Unit: milliseconds
# expr min lq median uq max neval
# union(vect, add) 176.34441 188.82082 261.09802 339.96974 493.7810 10
#c(vect, add[!add %chin% vect]) 35.37661 37.14743 47.06862 70.46896 203.7034 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With