Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract all words between two specific words in a character vector

Tags:

string

regex

r

Is there a more efficient method? How can I do this without stringr?

txt <- "I want to extract the words between this and that, this goes with that, this is a long way from that"

library(stringr)
w_start <- "this"
w_end <- "that"
pattern <- paste0(w_start, "(.*?)", w_end)
wordsbetween <- unlist(str_extract_all(txt, pattern))
gsub("^\\s+|\\s+$", "", str_sub(wordsbetween, nchar(w_start)+1, -nchar(w_end)-1))
[1] "and"                "goes with"          "is a long way from"
like image 416
Ben Avatar asked Apr 23 '13 05:04

Ben


People also ask

How do I extract a string between two characters in R?

While dealing with text data, we sometimes need to extract values between two words. These words can be close to each other, at the end sides or on random sides. If we want to extract the strings between two words then str_extract_all function of stringr package can be used.

How would you extract one particular word from a string?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

How do I extract the first part of a string in R?

The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().


2 Answers

This is an approach I use in qdap:

Using qdap:

library(qdap)
genXtract(txt, "this", "that")

## > genXtract(txt, "this", "that")
##         this  :  that1         this  :  that2         this  :  that3 
##                " and "          " goes with " " is a long way from " 

Without an add on package:

regmatches(txt, gregexpr("(?<=this).*?(?=that)", txt, perl=TRUE))

## > regmatches(txt, gregexpr("(?<=this).*?(?=that)", txt, perl=TRUE))
## [[1]]
## [1] " and "                " goes with "          " is a long way from "
like image 108
Tyler Rinker Avatar answered Oct 11 '22 15:10

Tyler Rinker


Here's another rough attempt using strsplit, though it can probably be refined further:

txtspl <- unlist(strsplit(gsub("[[:punct:]]","",txt),"this|that"))
txtspl[txtspl!=" "][-1]

#[1] " and "                " goes with "          " is a long way from "
like image 22
thelatemail Avatar answered Oct 11 '22 15:10

thelatemail