Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting patterns from text in R

Tags:

regex

r

My data is like:

t <- "The data is like hi hi hi hi  and hi hi end"

and my regular expression is:

grammer <- "[[:space:]]*(hi)+[[:space:]]"

After executing below two lines:

res <- gregexpr(grammer, t)
regmatches(t, res)

I got output:

 [[1]]
 [1] " hi " "hi "  "hi "  "hi "  " hi " "hi " 

however, I want something like: " hi hi hi hi " and " hi hi "

like image 923
jay_phate Avatar asked Oct 15 '14 09:10

jay_phate


People also ask

How do I substring in R?

Find substring in R using substr() method in R Programming is used to find the sub-string from starting index to the ending index values in a string.

What is Stringr in R?

The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.

What is regex R?

Details. A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE . There is also fixed = TRUE which can be considered to use a literal regular expression.


1 Answers

You could do like this,

> t<-"The data is like hi hi hi hi  and hi hi end"
> grammer<-"[[:space:]]*(hi[[:space:]])+[[:space:]]*"
> res<-gregexpr(grammer, t)
> regmatches(t, res)
[[1]]
[1] " hi hi hi hi  " " hi hi "  

OR

> grammer<-"[[:space:]]*(hi[[:space:]])+"
> res<-gregexpr(grammer, t)
> regmatches(t, res)
[[1]]
[1] " hi hi hi hi " " hi hi " 

OR

> t <- "The data is like hi hi hi hi and hi hi end hi"
> grammer<-"[[:space:]]*(hi\\>[[:space:]]?)+"
> res<-gregexpr(grammer, t)
> regmatches(t, res)
[[1]]
[1] " hi hi hi hi " " hi hi "       " hi"

Without leading or following spaces.

> t <- "The data is like hi hi hi hi and hi hi end hi"
> grammer<-"hi\\>([[:space:]]hi)*"
> res<-gregexpr(grammer, t)
> regmatches(t, res)
[[1]]
[1] "hi hi hi hi" "hi hi"       "hi"

Explanation:

  • [[:space:]]* Matches a space character zero or more times.
  • (hi[[:space:]])+ Matches the string hi and the following space one or more times.
like image 142
Avinash Raj Avatar answered Sep 29 '22 11:09

Avinash Raj