I have the following data:
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))
Further, I have the following vector of patterns:
my_patterns <- c("my example", "is my", "my other text")
What I want to achieve is to remove any text of my_patterns
that occurs in dat$x
.
I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.
Wrong solution:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")
dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))
I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat
data frame like start_pattern_1
and end_pattern_1
and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end
position overlaps with any start
position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.
Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.
So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.
Desired Output of dat
would be:
x some_other_cols short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short
Appreciate your help! Thanks.
New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:
library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}
If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"
The idea is to create the pattern with optional parts (the ?
after the grouping parentheses.
So we have roughly:
(is )?
<= optional "is" followed by spacemy
<= literal "my" followed by space(other text|example)?
<= Optional text after "my ", either "other text" or (the |
) "example"If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:
# Given datas
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
some_other_cols = c(1, 2, 2, 4))
my_patterns <- c("my example", "is my", "my other text")
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
ret <- strsplit(text,"")[[1]]
lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
sapply(regexec(pattern,vector),
function(x) {
if(x>-1) {
c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
}
})
}
# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}
Result after run:
> dat
x some_other_cols result
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With