Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Words of specific length in R using regular expressions

Tags:

string

regex

r

I have a code like (I got it here):

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("\\<[a-z]\\{4,10\\}\\>","",m)
x

I tried other ways of doing it, like

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("[^(\\b.{4,10}\\b)]","",m)
x

I need to remove words which are lesser than 4 or greater than 10 in length. Where am I going wrong?

like image 935
jackStinger Avatar asked Dec 10 '12 08:12

jackStinger


People also ask

How do I find the length of a character in RegEx?

To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.

Can I use RegEx in R?

A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE .

What is the use of grep () Grepl () substr ()?

17.4 grepl() grepl() returns a logical vector indicating which element of a character vector contains the match. For example, suppose we want to know which states in the United States begin with word “New”. Here, we can see that grepl() returns a logical vector that can be used to subset the original state.name vector.


2 Answers

  gsub("\\b[a-zA-Z0-9]{4,10}\\b", "", m) 
 "! # is gr8. I  likewhatishappening ! The  of   is ! the aforementioned  is ! #Wow"

Let's explain the regular expression terms :

  1. \b matches at a position that is called a "word boundary". This match is zero-length.
  2. [a-zA-Z0-9] :alphanumeric
  3. {4,10} :{min,max}

if you want to get the negation of this so , you put it between() and you take //1

gsub("([\\b[a-zA-Z0-9]{4,10}\\b])", "//1", m) 

"Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow"

It is funny to see that words with 4 letters exist in the 2 regexpr.

like image 186
agstudy Avatar answered Nov 05 '22 23:11

agstudy


# starting string
m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

# remove punctuation (optional)
v <- gsub("[[:punct:]]", " ", m)

# split into distinct words
w <- strsplit( v , " " )

# calculate the length of each word
x <- nchar( w[[1]] )

# keep only words with length 4, 5, 6, 7, 8, 9, or 10
y <- w[[1]][ x %in% 4:10 ]

# string 'em back together
z <- paste( unlist( y ), collapse = " " )

# voila
z
like image 22
Anthony Damico Avatar answered Nov 05 '22 22:11

Anthony Damico