Is there a better way to achieve this? I'd like to remove all strings from this vector, which are substrings of other elements.
words = c("please can you",
"please can",
"can you",
"how did you",
"did you",
"have you")
> words
[1] "please can you" "please can" "can you" "how did you" "did you" "have you"
library(data.table)
library(stringr)
dt = setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE))
dt[, found := str_detect(word1, word2)]
setdiff(words, dt[found == TRUE & word1 != word2, word2])
[1] "please can you" "how did you" "have you"
This works, but it seems like overkill and I'm interested to know a more elegant way of doing it.
Search for each component of words
in words
keeping those that occur once:
words[colSums(sapply(words, grepl, words, fixed = TRUE)) == 1]
giving:
[1] "please can you" "how did you" "have you"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With