Extract a sample of words around a particular word using stringr in R

Tags:

I've seen a couple of similar questions posted on SO regarding this topic, but they seem to be worded improperly (example) or in a different language (example).

In my scenario, I consider everything that is surrounded by white space to be a word. Emoticons, numbers, strings of letters that aren't really words, I don't care. I just want to get some context around the string that was found without having to read the entire file to figure out if it's a valid match.

I tried using the following, but it takes awhile to run if you've got a long text file:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

I'm assuming there is a much, much faster/more efficient way in which to do this, yes?

328

asked Dec 21 '15 19:12

tblznbits

2 Answers

Try this:

stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

Change the number inside the {} to suit your needs.

You can use non-capture (?:) groups, too, though I'm not sure yet whether that will improve speed.

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")

172

answered Sep 28 '22 12:09

Jota

I'd use unlist(strsplit) and then index the resulting vector. You could make it a function so that the number of words to fetch pre and post is a flexible parameter:

getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

    return(out)
  } else {
    warning('No matches')
  }
}

Works for a single match

getContext(text, 'Verulam')

# $before
#      [,1]     
# [1,] "and"    
# [2,] "created"
# [3,] "Baron"  
# 
# $after
#      [,1]     
# [1,] "in"     
# [2,] "1618[4]"
# [3,] "and"

Also works if there's more than one match

getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# 
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches

answered Sep 28 '22 13:09

arvi1000

Related questions
                            
                                Automatically built regex expressions that fit set of strings
                            
                                vim call function on a group in substitute string
                            
                                Ontology-based string classification
                            
                                Matching plurals using regex in C#
                            
                                java - Why replaceAll is not working?
                            
                                Url routing regex PHP
                            
                                Regular expression that never finishes running
                            
                                Regular expressions - Matching whitespace
                            
                                Scala Regex union
                            
                                In .NET's RegEx can I get a Groups collection from a Capture object?
                            
                                How can I use javascript split method using escape character? [duplicate]
                            
                                Nginx Block/Deny Access to multiple locations regex
                            
                                Most efficient regular expression for Nginx location
                            
                                Dart: RegExp by example
                            
                                How to find minimum, maximum length strings generated given a regular expression? [closed]
                            
                                match EOF but go to endless loop in flex
                            
                                How to split a string on comma that is NOT followed by a space?
                            
                                Why does strsplit return a list
                            
                                Regular expression for validating SQL Server table name
                            
                                How to remove non-valid unicode characters from strings in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract a sample of words around a particular word using stringr in R

Tags:

regex

r

stringr

tblznbits

People also ask

2 Answers

Jota

arvi1000

Recent Activity

Donate For Us