I have a series of strings in a dataframe like the ones below:
item_time<-c("pink dress july noon", "shirt early september morning", "purple dress
april", "tall purple shoes february")
And I want to extract all the characters to the left of a list of possible characters like these:
time<-c("january", "january night", "february","march","april","may", "may
morning", "june","july", "july noon","august","september","early september morning",
"october","november","december")
The result I want would look like this:
[1] pink dress
[2] shirt
[3] purple dress
[4] tall purple shoes
I can't separate them by spaces as there are varying number of words in the time and item lists. I also don't have a symbol that separates them. I feel that there should be a quite simple and elegant way of solving this but I can't figure it out.
We can use strsplit in Base R:
sapply(strsplit(item_time, split=paste0("\\s", time, collapse="|")), `[`, 1)
# [1] "pink dress" "shirt" "purple dress" "tall purple shoes"
Notes:
I first collapse the time vector and separate each term by |, then use that to split item_time with strsplit. Since the split argument in strsplit accepts regular expressions, it will interpret | as an OR operator effectively spliting item_time whenever it sees one of the terms in time. sapply(...,[, 1) then look at each element of the list and extract the first element, which will be the left most string after the split.
You can use sub as it is vectorized
sub(paste0("\\s*",time,".*",collapse="|"),"",item_time)
[1] "pink dress" "shirt" "purple dress" "tall purple shoes"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With