Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring extraction from vector in R

Tags:

regex

r

stringr

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:

countries <- c("United States", "Israel", "Canada")

How do I go about passing this vector of character values to extract exact matches from unstructured text.

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

In this example, the desired output would be:

ID     text
1      United States
4      Israel
5      Canada

So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

like image 715
Brian P Avatar asked Nov 04 '25 12:11

Brian P


1 Answers

1. stringr

We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2. base R

Without using any external packages, we can remove the characters other than those found in 'ind'

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3. stringi

We could also use the faster stri_extract from stringi

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada
like image 153
akrun Avatar answered Nov 07 '25 09:11

akrun