Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract and remove a string? So I can have similar expressions match 1 time instead of multiple times

Tags:

string

r

stringr

Problem description: I'm currently extracting names from a book series. Many characters will go by nicknames, parts of names, or titles. I have a list of names that I'm using as a pattern on all of the data. The problem is that I'm getting multiple matches for full names and the parts of names. There are a total of 3000 names and variations of names that I'm running through a lot of text. The names are currently extracted in order from longest strings to shortest.

Question:

How can I ensure that after a pattern is extracted, that whatever text it matches is then removed from the string?

What I get:

str_extract("Mr Bean and friends", pattern = fixed(c("Mr Bean", "Bean", "Mr")))  
[1] "Mr Bean" "Bean"    "Mr"     

What I want: (I know that I can't achieve this only using str_extract() or one line of code)

str_extract("Mr Bean and friends", pattern = fixed (c("Mr Bean", "Bean", "Mr")))  
[1] "Mr Bean" NA NA    
like image 438
Christopher Peralta Avatar asked Oct 16 '22 07:10

Christopher Peralta


2 Answers

One option would be to update recursively. As we want an output vector of length 'n' equal to the length of pattern vector, create an output vector to store the values, then update the initial string after execution of each 'pattern' by removing the 'pattern' from the string and updating it

library(stringr)
for(i in seq_along(pat))  {
      out[i] <- str_extract(str1, pattern = fixed(pat[i]))
      str1 <- str_remove(str1, pat[i])
 }
out
#[1] "Mr Bean" NA        NA   

Or the same method with vapply and updating the initial string with <<-

unname(vapply(pat, function(p) {
   out <- str_extract(str1, p)
   str1 <<- str_remove(str1, p)
   out}, character(1)))
#[1] "Mr Bean" NA        NA       

data

# initialize an output vector
out <- character(length(pat))
# pattern vector
pat <- c("Mr Bean", "Bean", "Mr")
# initial string
str1 <- "Mr Bean and friends"
str2 <- str1
like image 72
akrun Avatar answered Oct 31 '22 17:10

akrun


Would using pmatch work?

my_string <- "Mr Bean and friends"
my_pattern <- c("Mr Bean", "Bean", "Mr")

out <- my_pattern[pmatch(my_pattern,my_string)]
out
[1] "Mr Bean" NA        NA
like image 43
twb10 Avatar answered Oct 31 '22 16:10

twb10