I've seen a couple of similar questions posted on SO regarding this topic, but they seem to be worded improperly (example) or in a different language (example).
In my scenario, I consider everything that is surrounded by white space to be a word. Emoticons, numbers, strings of letters that aren't really words, I don't care. I just want to get some context around the string that was found without having to read the entire file to figure out if it's a valid match.
I tried using the following, but it takes awhile to run if you've got a long text file:
text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."
stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")
I'm assuming there is a much, much faster/more efficient way in which to do this, yes?
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().
Extract a specific word from a string using find() method. If we want to extract a specific word from the string and we do not know the exact position of the word, we can first find the position of the word using find() method and then we can extract the word using string slicing.
The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.
Try this:
stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")
#[1] "and created Baron Verulam in 1618[4] and"
Change the number inside the {}
to suit your needs.
You can use non-capture (?:)
groups, too, though I'm not sure yet whether that will improve speed.
stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")
I'd use unlist(strsplit)
and then index the resulting vector. You could make it a function so that the number of words to fetch pre and post is a flexible parameter:
getContext <- function(text, look_for, pre = 3, post=pre) {
# create vector of words (anything separated by a space)
t_vec <- unlist(strsplit(text, '\\s'))
# find position of matches
matches <- which(t_vec==look_for)
# return words before & after if any matches
if(length(matches) > 0) {
out <-
list(before = ifelse(m-pre < 1, NA,
sapply(matches, function(m) t_vec[(m - pre):(m - 1)])),
after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))
return(out)
} else {
warning('No matches')
}
}
Works for a single match
getContext(text, 'Verulam')
# $before
# [,1]
# [1,] "and"
# [2,] "created"
# [3,] "Baron"
#
# $after
# [,1]
# [1,] "in"
# [2,] "1618[4]"
# [3,] "and"
Also works if there's more than one match
getContext(text, 'he')
# $before
# [,1] [,2] [,3] [,4]
# [1,] "After" "nature." "in" "John"
# [2,] "his" "Most" "1621;[3][b]" "Aubrey"
# [3,] "death," "importantly," "as" "stating"
#
# $after
# [,1] [,2] [,3] [,4]
# [1,] "remained" "argued" "died" "contracted"
# [2,] "extremely" "this" "without" "the"
# [3,] "influential" "could" "heirs," "condition"
getContext(text, 'fruitloops')
# Warning message:
# In getContext(text, "fruitloops") : No matches
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With