The following problem: I have two data frames where I want to match one vector from data frame data1 with a vector from data frame data2.
data1 <- data.frame(v1 = c("horse", "duck", "bird"), v2 = c(1,2,3))
data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird", "bird"))
If a character string in data2 is matched it should be replaced by the corresponding value v2 from data1. The result looks like as follows:
for(i in 1:nrow(data1)) data2[,1] <- gsub(data1[i,1], data1[i,2], data2[,1], fixed=T)
data2
However, is there an idea using a vectorized solution instead of a for loop to create a better performance with huge datasets?
Thanks in advance!
--Updated:
What happens when I have the case, that both dataframes don´t have the same length?
data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird","bird", "bird"))
When I use this solution:
data2$v1 <- mapply(sub, data1$v1, data1$v2, data2$v1)
Then I get the following warning message:
1: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter 2: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter
However, the mgsub solution works perfect! Thank you!
Using the updated data2
. The nrows
between data1
and data2
are different, Here, we are assuming that any match between v1
columns of both datasets should be replaced by the corresponding value of v2
column in data1
.
library(qdap)
mgsub(as.character(data1$v1), data1$v2, data2$v1)
#[1] "car, 1, mouse" "2, 3" "3" "3"
Note mgsub
has some error handling that deals with situations where a substring is found within a larger string and both are in the 'to be replaced' list. Here's an example with horse
and horses
:
data1 <- data.frame(v1 = c("horse", "duck", "bird", "horse", "horses"), v2 = 1:5)
data2 <- data.frame(v1 = c("car, horses, mouse", "duck, bird, horse", "bird"))
library(stringi)
stri_replace_all_fixed(data2$v1, data1$v1, data1$v2)
## [1] "car, 1s, mouse" "2, bird, horse" "3" "car, 4s, mouse" "duck, bird, horse"
## Warning message:
## In stri_replace_all_fixed(data2$v1, data1$v1, data1$v2) :
## longer object length is not a multiple of shorter object length
library(qdap)
mgsub(as.character(data1$v1), data1$v2, data2$v1)
## [1] "car, 5, mouse" "2, 3, 4" "3"
mgsub
makes sure the longer words are replaced first. This makes mgsub
slower but safer. Depending on your data type/needs either solution here may be of use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With