Consider two vectors.
a <- c(123, 234, 432, 223)
b <- c(234, 238, 342, 325, 326)
Now, I want to match last two digits of a
to first two digits of b
and create a new vector pasting first digit of a
, the matched part and last digit of b
. My expected output is :
[1] 1234 1238 2342 4325 4326 2234 2238
For simplicity purpose consider all the elements would always be of length 3.
I have tried :
sub_a <- substr(a, 2, 3) #get last two digits of a
sub_b <- substr(b, 1, 2) #get first two digits of b
common <- intersect(sub_a, sub_b)
common
gives me the common elements in both a
and b
which are :
[1] "23" "34" "32"
and then I use match
and paste0
together and I get incomplete output.
paste0(a[match(common, sub_a)], substr(b[match(common, sub_b)], 3, 3))
#[1] "1234" "2342" "4325"
as match
matches only with the first occurrences.
How can I achieve my expected output?
A possible solution:
a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))
df <- merge(stack(a), stack(b), by = 'ind')
paste0(substr(df$values.x, 1, 1), df$values.y)
which gives:
[1] "1234" "1238" "2234" "2238" "4325" "4326" "2342"
A second alternative:
a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))
l <- lapply(names(a), function(x) b[x == names(b)])
paste0(substr(rep(a, lengths(l)), 1, 1), unlist(l))
which gives the same result and is considerably faster (see the benchmark).
Probably a little complex but works:
unlist( sapply( a, function(x) {
regex <- paste0( substr(x, 2, 3), '(\\d)')
z <- sub(regex, paste0(x, "\\1"), b)
z[!b %in% z]
} ))
which give: [1] "1234" "1238" "2342" "4325" "4326" "2234" "2238"
The main idea is to create a regex for each entry in a, apply this regex to b and replace the values with the current a value and append only the last digit captured (the (\\d)
part of the regex, then filter the resulting vector to get back only the modified values.
Out of curiosity, I did a small benchmark (adding sub_a and sub_b creation into Sotos and Heikki answers so everyone start on the same initial vectors a of 400 observations and b of 500 observations):
Unit: milliseconds
expr min lq mean median uq max neval
Jaap(a, b) 341.0224 342.6853 345.2182 344.3482 347.3161 350.2840 3
Tensi(a, b) 415.9175 416.2672 421.9148 416.6168 424.9134 433.2100 3
Heikki(a, b) 126.9859 139.6727 149.3252 152.3594 160.4948 168.6302 3
Sotos(a, b) 151.1264 164.9869 172.0310 178.8474 182.4833 186.1191 3
MattWBase(a, b) 286.9651 290.8923 293.3795 294.8195 296.5867 298.3538 3
Another way could be to use expand.grid
, so picking up at your sub_a
and sub_b
,
d1 <- expand.grid(a, b, stringsAsFactors = FALSE)
d2 <- expand.grid(sub_a, sub_b, stringsAsFactors = FALSE)
i1 <- d2$Var1 == d2$Var2
d1 <- d1[i1,]
d1$Var1 <- substr(d1$Var1, 1, 1)
do.call(paste0, d1)
#[1] "1234" "2234" "1238" "2238" "2342" "4325" "4326"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With