Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match substring of two vectors and create a new vector combining them

Tags:

r

vector

Consider two vectors.

a <- c(123, 234, 432, 223)
b <- c(234, 238, 342, 325, 326)

Now, I want to match last two digits of a to first two digits of b and create a new vector pasting first digit of a, the matched part and last digit of b. My expected output is :

[1] 1234 1238 2342 4325 4326 2234 2238

For simplicity purpose consider all the elements would always be of length 3.

I have tried :

sub_a <- substr(a, 2, 3)   #get last two digits of a
sub_b <- substr(b, 1, 2)   #get first two digits of b
common <- intersect(sub_a, sub_b) 

common gives me the common elements in both a and b which are :

[1] "23" "34" "32"

and then I use match and paste0 together and I get incomplete output.

paste0(a[match(common, sub_a)], substr(b[match(common, sub_b)], 3, 3))
#[1] "1234" "2342" "4325"

as match matches only with the first occurrences.

How can I achieve my expected output?

like image 768
Ronak Shah Avatar asked Nov 22 '17 09:11

Ronak Shah


Video Answer


3 Answers

A possible solution:

a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))

df <- merge(stack(a), stack(b), by = 'ind')
paste0(substr(df$values.x, 1, 1), df$values.y)

which gives:

[1] "1234" "1238" "2234" "2238" "4325" "4326" "2342"

A second alternative:

a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))

l <- lapply(names(a), function(x) b[x == names(b)])
paste0(substr(rep(a, lengths(l)), 1, 1), unlist(l))

which gives the same result and is considerably faster (see the benchmark).

like image 152
Jaap Avatar answered Oct 15 '22 01:10

Jaap


Probably a little complex but works:

unlist( sapply( a, function(x) {
  regex <- paste0( substr(x, 2, 3), '(\\d)')
  z <- sub(regex, paste0(x, "\\1"), b)
  z[!b %in% z] 
} ))

which give: [1] "1234" "1238" "2342" "4325" "4326" "2234" "2238"

The main idea is to create a regex for each entry in a, apply this regex to b and replace the values with the current a value and append only the last digit captured (the (\\d) part of the regex, then filter the resulting vector to get back only the modified values.

Out of curiosity, I did a small benchmark (adding sub_a and sub_b creation into Sotos and Heikki answers so everyone start on the same initial vectors a of 400 observations and b of 500 observations):

Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
      Jaap(a, b) 341.0224 342.6853 345.2182 344.3482 347.3161 350.2840     3
     Tensi(a, b) 415.9175 416.2672 421.9148 416.6168 424.9134 433.2100     3
    Heikki(a, b) 126.9859 139.6727 149.3252 152.3594 160.4948 168.6302     3
     Sotos(a, b) 151.1264 164.9869 172.0310 178.8474 182.4833 186.1191     3
 MattWBase(a, b) 286.9651 290.8923 293.3795 294.8195 296.5867 298.3538     3
like image 40
Tensibai Avatar answered Oct 15 '22 01:10

Tensibai


Another way could be to use expand.grid, so picking up at your sub_a and sub_b,

d1 <- expand.grid(a, b, stringsAsFactors = FALSE)
d2 <- expand.grid(sub_a, sub_b, stringsAsFactors = FALSE)
i1 <- d2$Var1 == d2$Var2
d1 <- d1[i1,] 
d1$Var1 <- substr(d1$Var1, 1, 1)

do.call(paste0, d1)
#[1] "1234" "2234" "1238" "2238" "2342" "4325" "4326"
like image 41
Sotos Avatar answered Oct 15 '22 00:10

Sotos