Problem description: I'm currently extracting names from a book series. Many characters will go by nicknames, parts of names, or titles. I have a list of names that I'm using as a pattern on all of the data. The problem is that I'm getting multiple matches for full names and the parts of names. There are a total of 3000 names and variations of names that I'm running through a lot of text. The names are currently extracted in order from longest strings to shortest.
Question:
How can I ensure that after a pattern is extracted, that whatever text it matches is then removed from the string?
What I get:
str_extract("Mr Bean and friends", pattern = fixed(c("Mr Bean", "Bean", "Mr")))
[1] "Mr Bean" "Bean" "Mr"
What I want: (I know that I can't achieve this only using str_extract() or one line of code)
str_extract("Mr Bean and friends", pattern = fixed (c("Mr Bean", "Bean", "Mr")))
[1] "Mr Bean" NA NA
One option would be to update recursively. As we want an output vector
of length
'n' equal to the length
of pattern
vector
, create an output vector to store the values, then update the initial string after execution of each 'pattern' by removing the 'pattern' from the string and updating it
library(stringr)
for(i in seq_along(pat)) {
out[i] <- str_extract(str1, pattern = fixed(pat[i]))
str1 <- str_remove(str1, pat[i])
}
out
#[1] "Mr Bean" NA NA
Or the same method with vapply
and updating the initial string with <<-
unname(vapply(pat, function(p) {
out <- str_extract(str1, p)
str1 <<- str_remove(str1, p)
out}, character(1)))
#[1] "Mr Bean" NA NA
# initialize an output vector
out <- character(length(pat))
# pattern vector
pat <- c("Mr Bean", "Bean", "Mr")
# initial string
str1 <- "Mr Bean and friends"
str2 <- str1
Would using pmatch work?
my_string <- "Mr Bean and friends"
my_pattern <- c("Mr Bean", "Bean", "Mr")
out <- my_pattern[pmatch(my_pattern,my_string)]
out
[1] "Mr Bean" NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With