I have a code like (I got it here):
m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
x<- gsub("\\<[a-z]\\{4,10\\}\\>","",m)
x
I tried other ways of doing it, like
m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
x<- gsub("[^(\\b.{4,10}\\b)]","",m)
x
I need to remove words which are lesser than 4 or greater than 10 in length. Where am I going wrong?
To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.
A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE .
17.4 grepl() grepl() returns a logical vector indicating which element of a character vector contains the match. For example, suppose we want to know which states in the United States begin with word “New”. Here, we can see that grepl() returns a logical vector that can be used to subset the original state.name vector.
gsub("\\b[a-zA-Z0-9]{4,10}\\b", "", m)
"! # is gr8. I likewhatishappening ! The of is ! the aforementioned is ! #Wow"
Let's explain the regular expression terms :
if you want to get the negation of this so , you put it between() and you take //1
gsub("([\\b[a-zA-Z0-9]{4,10}\\b])", "//1", m)
"Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow"
It is funny to see that words with 4 letters exist in the 2 regexpr.
# starting string
m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
# remove punctuation (optional)
v <- gsub("[[:punct:]]", " ", m)
# split into distinct words
w <- strsplit( v , " " )
# calculate the length of each word
x <- nchar( w[[1]] )
# keep only words with length 4, 5, 6, 7, 8, 9, or 10
y <- w[[1]][ x %in% 4:10 ]
# string 'em back together
z <- paste( unlist( y ), collapse = " " )
# voila
z
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With