If I have a string:
moon <- "The cow jumped over the moon with a silver plate in its mouth"
Is there a way I can extract the words in the neighborhood of "moon"
. Neighborhood could be 2 or 3 words around "moon".
So if my
"The cow jumped over the moon with a silver plate in its mouth"
I want my output only to be:
"jumped over the moon with a silver"
I know I can use str_locate
if I wanted to extract by characters, but not sure how I could do it using "words". Can this be done in R?
Thanks & Regards, Simak
Here's how I'd do it:
keyword <- "moon"
lookaround <- 2
pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword,
"( [[:alpha:]]+){0,", lookaround, "}")
regmatches(str, regexpr(pattern, str))[[1]]
# [1] "The cow jumped over"
The idea: Search for any character followed by a space occurring a minimum of 0 times and a maximum of "lookaround" (here 2) times, then followed by "keyword" (here "moon"), then followed by space and bunch of characters pattern repeated between 0 and "lookaround" times. The regexpr
function gives the start and stop of this pattern. regmatches
that wraps this function then fetches the sub-string from this start/stop positions.
Note: regexpr
can be replaced with gregexpr
if you want to search for more than 1 occurrence of the same pattern.
str <- "The cow jumped over the moon with a silver plate in its mouth"
ll <- rep(str, 1e5)
hong <- function(str) {
str <- strsplit(str, " ")
sapply(str, function(y) {
i <- which(y=="moon")
paste(y[seq(max(1, (i-2)), min((i+2), length(y)))], collapse= " ")
})
}
arun <- function(str) {
keyword <- "moon"
lookaround <- 2
pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword,
"( [[:alpha:]]+){0,", lookaround, "}")
regmatches(str, regexpr(pattern, str))
}
require(microbenchmark)
microbenchmark(t1 <- hong(ll), t2 <- arun(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- hong(ll) 6.172986 6.384981 6.478317 6.654690 7.193329 10
# t2 <- arun(ll) 1.175950 1.192455 1.200674 1.227279 1.326755 10
identical(t1, t2) # [1] TRUE
Use strsplit
:
x <- strsplit(str, " ")[[1]]
i <- which(x == "moon")
paste(x[seq(max(1, (i-2)), min((i+2), length(x)))], collapse= " ")
Here's an approach using the tm
package (when all you've got is a hammer...)
moon <- "The cow jumped over the moon with a silver plate in its mouth"
require(tm)
my.corpus <- Corpus(VectorSource(moon))
# Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
neighborhood <- 3 # how many words either side of word of interest
neighborhood1 <- 2 + neighborhood * 2
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = neighborhood1, max = neighborhood1))
dtm <- TermDocumentMatrix(my.corpus, control = list(tokenize = ngramTokenizer))
inspect(dtm)
# find ngrams that have the word of interest in them
word <- 'moon'
subset_ngrams <- dtm$dimnames$Terms[grep(word, dtm$dimnames$Terms)]
# keep only ngrams with the word of interest in the middle. This
# removes duplicates and lets us see what's on either side
# of the word of interest
subset_ngrams <- subset_ngrams[sapply(subset_ngrams, function(i) {
tmp <- unlist(strsplit(i, split=" "))
tmp <- tmp[length(tmp) - span]
tmp} == word)]
# inspect output
subset_ngrams
[1] "jumped over the moon with a silver plate"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With