Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match two vectors and replace in string

Tags:

replace

r

match

The following problem: I have two data frames where I want to match one vector from data frame data1 with a vector from data frame data2.

data1 <- data.frame(v1 = c("horse", "duck", "bird"), v2 = c(1,2,3))
data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird", "bird"))

If a character string in data2 is matched it should be replaced by the corresponding value v2 from data1. The result looks like as follows:

for(i in 1:nrow(data1)) data2[,1] <- gsub(data1[i,1], data1[i,2], data2[,1], fixed=T)
data2

However, is there an idea using a vectorized solution instead of a for loop to create a better performance with huge datasets?

Thanks in advance!

--Updated:

What happens when I have the case, that both dataframes don´t have the same length?

data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird","bird", "bird"))

When I use this solution:

data2$v1 <- mapply(sub, data1$v1, data1$v2, data2$v1)

Then I get the following warning message:

1: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter 2: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter

However, the mgsub solution works perfect! Thank you!

like image 203
OAM Avatar asked Dec 26 '22 01:12

OAM


1 Answers

Using the updated data2. The nrows between data1 and data2 are different, Here, we are assuming that any match between v1 columns of both datasets should be replaced by the corresponding value of v2 column in data1.

library(qdap)
mgsub(as.character(data1$v1), data1$v2, data2$v1)
#[1] "car, 1, mouse" "2, 3"          "3"             "3"    

Note mgsub has some error handling that deals with situations where a substring is found within a larger string and both are in the 'to be replaced' list. Here's an example with horse and horses:

data1 <- data.frame(v1 = c("horse", "duck", "bird", "horse", "horses"), v2 = 1:5)
data2 <- data.frame(v1 = c("car, horses, mouse", "duck, bird, horse", "bird"))

library(stringi)
stri_replace_all_fixed(data2$v1, data1$v1, data1$v2)

## [1] "car, 1s, mouse"    "2, bird, horse"    "3"                 "car, 4s, mouse"    "duck, bird, horse"
## Warning message:
## In stri_replace_all_fixed(data2$v1, data1$v1, data1$v2) :
##   longer object length is not a multiple of shorter object length

library(qdap)
mgsub(as.character(data1$v1), data1$v2, data2$v1)

## [1] "car, 5, mouse" "2, 3, 4"       "3"  

mgsub makes sure the longer words are replaced first. This makes mgsub slower but safer. Depending on your data type/needs either solution here may be of use.

like image 90
akrun Avatar answered Jan 10 '23 08:01

akrun