I have two very large vectors that I need to concatenate with a delimiter to form unique IDs. For example:
set.seed(1)
vec1 <- sample(1:10, 10000000, replace = T)
vec2 <- sample(1:1000000000, 10000000)
I am currently using paste0()
:
system.time({
uniq_id <- paste0(vec1, "_", vec2)
})
However, due to the size of vec1
and vec2
this is quite slow. Is there an alternate method with greater performance?
Concatenate two strings # use cbind method to bind the strings into a vector. vec < - cbind(string1, string2) # combined vector. # use paste() function to perform string concatenation. # elements are joined together.
The difference between: paste and paste0 is that paste function provides a separator operator, whereas paste0 does not. print ()- Print function is used for printing the output of any object in R. This recipe demonstrates an example on paste, paste0 and print function in R.
The paste() function with collapse argument When you pass a paste argument to a vector, the separator parameter will not work. Hence here comes the collapse parameter, which is highly useful when you are dealing with the vectors. It represents the symbol or values which separate the elements in the vector.
A more efficient way is stringi::stri_c
library(microbenchmark)
b <- microbenchmark(
paste = paste0(vec1, "_", vec2),
stringi = stringi::stri_c(vec1, vec2, sep = "_"),
times = 10
)
Result
b
#Unit: seconds
# expr min lq mean median uq max neval cld
# paste 5.475398 5.509957 5.544477 5.542728 5.566904 5.632173 10 b
# stringi 3.862541 3.871826 3.896242 3.897264 3.914894 3.934175 10 a
Comparing paste
, paste0
(R version 4.1.0 ), stringi::stri_c
(Version 1.6.2) and stringr::str_c
(Version 1.4.0) I could not observe much difference in performance but maybe this will depend on what will be concatenated. There is much difference if numbers or characters are used and if the characters consists of numbers or letters. When there are only letters stringi and stringr seams to be faster than paste.
M <- alist(
paste0 = paste0(vec1, "_", vec2)
, paste = paste(vec1, "_", vec2, sep = "")
, pasteS = paste(vec1, vec2, sep = "_")
, stringi = stringi::stri_c(vec1, "_", vec2)
, stringiS = stringi::stri_c(vec1, vec2, sep = "_")
, stringr = stringr::str_c(vec1, "_", vec2)
, stringrS = stringr::str_c(vec1, vec2, sep = "_")
)
set.seed(42)
n <- 1e5
vec1 <- sample(1:10, n, TRUE)
vec2 <- sample(1:1000000000, n, TRUE)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 62.8ms 63.9ms 15.6 2.29MB 2.23 7 1 447ms
#2 paste 61.9ms 63ms 15.9 2.29MB 0 8 0 503ms
#3 pasteS 57.5ms 58.6ms 17.1 2.29MB 2.13 8 1 468ms
#4 stringi 57.1ms 57.6ms 17.2 2.29MB 0 9 0 524ms
#5 stringiS 56.2ms 66.2ms 14.4 2.29MB 2.40 6 1 417ms
#6 stringr 57.9ms 62.9ms 14.8 2.29MB 0 8 0 541ms
#7 stringrS 55ms 61.4ms 15.3 2.29MB 0 8 0 523ms
vec1 <- as.character(vec1)
vec2 <- as.character(vec2)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 34.2ms 35.3ms 28.2 781KB 2.17 13 1 460ms
#2 paste 35.1ms 35.7ms 27.9 781KB 0 14 0 502ms
#3 pasteS 32ms 33.5ms 29.9 781KB 2.14 14 1 468ms
#4 stringi 33.7ms 35.6ms 28.1 781KB 0 15 0 534ms
#5 stringiS 32.6ms 33.9ms 29.6 781KB 2.12 14 1 472ms
#6 stringr 34.6ms 34.9ms 28.5 781KB 0 15 0 526ms
#7 stringrS 33.1ms 33.4ms 29.7 781KB 2.12 14 1 471ms
set.seed(42)
n <- 1e5
vec1 <- as.character(sample(0:9, n, TRUE))
vec2 <- as.character(sample(0:9, n, TRUE))
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 18.9ms 19ms 52.4 781KB 2.02 26 1 496ms
#2 paste 18.9ms 19ms 52.5 781KB 0 27 0 514ms
#3 pasteS 15.2ms 15.3ms 65.3 781KB 2.04 32 1 490ms
#4 stringi 15.1ms 15.1ms 65.7 781KB 0 33 0 502ms
#5 stringiS 13.5ms 13.5ms 73.7 781KB 2.05 36 1 489ms
#6 stringr 15.1ms 15.2ms 65.7 781KB 2.05 32 1 487ms
#7 stringrS 13.4ms 13.5ms 73.3 781KB 0 37 0 505ms
set.seed(42)
n <- 1e5
vec1 <- paste(sample(0:9, n, TRUE))
vec2 <- paste(sample(0:9, n, TRUE))
bench::mark(exprs = M)
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 18.95ms 19.18ms 52.1 781KB 0 27 0 518ms
#2 paste 18.78ms 18.98ms 52.6 781KB 2.02 26 1 494ms
#3 pasteS 14.29ms 14.49ms 69.0 781KB 0 35 0 508ms
#4 stringi 9.6ms 9.83ms 101. 781KB 2.02 50 1 495ms
#5 stringiS 7.55ms 7.73ms 127. 781KB 2.01 63 1 496ms
#6 stringr 9.58ms 9.75ms 101. 781KB 2.03 50 1 493ms
#7 stringrS 7.54ms 7.77ms 127. 781KB 2.02 63 1 496ms
set.seed(42)
n <- 1e5
vec1 <- sample(letters, n, TRUE)
vec2 <- sample(LETTERS, n, TRUE)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 15.98ms 16.02ms 61.5 781KB 2.05 30 1 488ms
#2 paste 16.02ms 16.09ms 62.1 781KB 2.07 30 1 483ms
#3 pasteS 11.96ms 12.03ms 83.0 781KB 2.02 41 1 494ms
#4 stringi 7.97ms 8.07ms 123. 781KB 4.18 59 2 478ms
#5 stringiS 6.37ms 6.43ms 154. 781KB 4.12 75 2 486ms
#6 stringr 7.97ms 8.02ms 124. 781KB 2.04 61 1 491ms
#7 stringrS 6.43ms 6.49ms 153. 781KB 4.09 75 2 489ms
The differences depend on how the character
is internal stored. Either as CHARSXP
or REALSXP
or INTSXP
.
x <- as.character(1:2)
.Internal(inspect(x))
#@55d9df5270d8 16 STRSXP g0c0 [REF(1)] <deferred string conversion>
# @55d9df527180 13 INTSXP g0c0 [REF(65535)] 1 : 2 (compact)
x <- as.character(c(0,2))
.Internal(inspect(x))
#@55d9df5430a0 16 STRSXP g0c0 [REF(1)] <deferred string conversion>
# @55d9df6720a8 14 REALSXP g0c2 [REF(65535)] (len=2, tl=0) 0,2
x <- paste(1:2)
.Internal(inspect(x))
#@55d9df610d08 16 STRSXP g0c2 [REF(1)] (len=2, tl=0)
# @55d9d2e30458 09 CHARSXP g1c1 [MARK,REF(40995),gp=0x61] [ASCII] [cached] "1"
# @55d9d2e58b00 09 CHARSXP g1c1 [MARK,REF(40555),gp=0x60] [ASCII] [cached] "2"
x <- letters[1:2]
.Internal(inspect(x))
#@55d9df672168 16 STRSXP g0c2 [REF(1)] (len=2, tl=0)
# @55d9d2c80518 09 CHARSXP g1c1 [MARK,REF(541),gp=0x61] [ASCII] [cached] "a"
# @55d9d2fb7d58 09 CHARSXP g1c1 [MARK,REF(44),gp=0x61] [ASCII] [cached] "b"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With