Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove part of a string based on overlapping patterns

Tags:

regex

r

stringr

I have the following data:

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
                  some_other_cols = c(1, 2, 2))

Further, I have the following vector of patterns:

my_patterns <- c("my example", "is my", "my other text")

What I want to achieve is to remove any text of my_patterns that occurs in dat$x.

I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.

Wrong solution:

library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
  mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.

Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.

So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.

Desired Output of dat would be:

x                                    some_other_cols    short_x
this is my example text              1                  this text
and here is my other text example    2                  and here example
my other text is short               2                  is short

Appreciate your help! Thanks.

like image 664
deschen Avatar asked Jan 28 '20 09:01

deschen


1 Answers

New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  if (nrow(positions) == 0) return(text)
  ret <- strsplit(text,"")[[1]]
  lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
  do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:

> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this  text"        "and here  example" " is short" 

The idea is to create the pattern with optional parts (the ? after the grouping parentheses.

So we have roughly:

  • (is )? <= optional "is" followed by space
  • my <= literal "my" followed by space
  • (other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"

If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:

# Given datas
dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
                some_other_cols = c(1, 2, 2, 4))

my_patterns <- c("my example", "is my", "my other text")

# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  ret <- strsplit(text,"")[[1]]
  lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
  sapply(regexec(pattern,vector), 
         function(x) {
           if(x>-1) { 
             c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
           }
         })
}

# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}

Result after run:

> dat
                                  x some_other_cols           result
1           this is my example text               1        this text
2 and here is my other text example               2 and here example
3            my other text is short               2         is short
4                  yet another text               4 yet another text
like image 112
Tensibai Avatar answered Sep 25 '22 22:09

Tensibai