I have a vector of strings, in the following format:
strings <- c("UUDBK", "KUVEB", "YVCYE")
I also have a data frame like this:
replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere)
I want the strings vector to be replaced with the value in its corresponding "replacewith" column in this data frame. Currently the way I am doing it is:
final <- sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x, dataframe$searchhere), 1]))
However, this is very computationally heavy to be doing this with a character vector with length 10^9.
What is a better way to do this?
Thanks!
You can replace a string in the pandas DataFrame column by using replace(), str. replace() with lambda functions.
replace() function is used to replace a string, regex, list, dictionary, series, number, etc. from a Pandas Dataframe in Python.
Pandas DataFrame replace() Method The replace() method replaces the specified value with another specified value. The replace() method searches the entire DataFrame and replaces every case of the specified value.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
Similar to @AntoniosK's idea, this instead uses hashmap
to map the strings to their values. hashmap
is implemented with Rcpp
internally, so it is very fast:
library(hashmap)
library(tidyr)
search_replace = separate_rows(dataframe, searchhere)
search_hash = hashmap(search_replace[,2], search_replace[,1])
search_hash[[strings]]
Results:
> search_hash
## (character) => (numeric)
## [KHUDN] => [+2.000000]
## [KUEBN] => [+2.000000]
## [UGEVB] => [+4.000000]
## [KUVEB] => [+4.000000]
## [IYVEK] => [+8.000000]
## [IHVYV] => [+8.000000]
## [...] => [...]
> search_hash[[strings]]
[1] 8 4 8
Benchmarks:
> OP_func = function(){sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x,dataframe$searchhere), 1]))}
Unit: microseconds
expr min lq mean median uq max neval
OP_func() 121.191 124.9410 190.36472 129.8760 151.193 3370.047 100
d[d$searchhere %in% strings, ] 36.714 40.6605 52.85093 43.8185 61.583 147.246 100
search_hash[[strings]] 14.212 18.1590 25.05212 21.5150 29.608 58.820 100
Also note that @AntoniosK's solution does not work if there are duplicates in strings
, while hashmap
will return the correct mapping for each element in the correct position.
Example:
> strings_large = sample(search_replace$searchhere, 100, replace = TRUE)
> strings_large
[1] "YVCYE" "KUVEB" "KUYVE" "KHUDN" "KUYVE" "KHUDN" "KUEBN" "UUDBK" "KHUDN" "YVCYE" "IYVEK"
[12] "KUEBN" "KHUDN" "IHBEJ" "YVCYE" "KHUDN" "KUEBN" "UGEVB" "UUDBK" "KUYVE" "KHUDN" "IHBEJ"
[23] "IHVYV" "KUVEB" "IYVEK" "KHUDN" "KHUDN" "KUYVE" "YVCYE" "UUDBK" "KUYVE" "IHVYV" "KUYVE"
[34] "KUEBN" "KUYVE" "UUDBK" "KUYVE" "KUVEB" "KUVEB" "YVCYE" "KUYVE" "KHUDN" "KUVEB" "YVCYE"
[45] "IHBEJ" "YVCYE" "KHUDN" "UUDBK" "KUEBN" "IYVEK" "IHVYV" "UUDBK" "KUYVE" "KUEBN" "YVCYE"
[56] "UGEVB" "YVCYE" "KUYVE" "IHVYV" "KUEBN" "IHVYV" "IHBEJ" "KUVEB" "IHVYV" "KUYVE" "KUEBN"
[67] "IYVEK" "KUVEB" "KUEBN" "UGEVB" "KUEBN" "KUVEB" "IHBEJ" "KUYVE" "YVCYE" "YVCYE" "IHVYV"
[78] "YVCYE" "KHUDN" "KHUDN" "YVCYE" "IYVEK" "KUYVE" "KHUDN" "UGEVB" "YVCYE" "IHVYV" "KUVEB"
[89] "IYVEK" "KUEBN" "UGEVB" "UUDBK" "IYVEK" "IHBEJ" "IHBEJ" "UUDBK" "KUVEB" "UGEVB" "IYVEK"
[100] "IYVEK"
> search_hash[[strings_large]]
[1] 8 4 8 2 8 2 2 8 2 8 8 2 2 2 8 2 2 4 8 8 2 2 8 4 8 2 2 8 8 8 8 8 8 2 8 8 8 4 4 8 8 2 4 8
[45] 2 8 2 8 2 8 8 8 8 2 8 4 8 8 8 2 8 2 4 8 8 2 8 4 2 4 2 4 2 8 8 8 8 8 2 2 8 8 8 2 4 8 8 4
[89] 8 2 4 8 8 2 2 8 4 4 8 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With