Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exclude elements from vector based on regular expression pattern

Tags:

regex

r

I have some data which I want to clean up using a regular expression in R.

It is easy to find how to get elements that contain certain patterns, or do not contain certain words (strings), but I can't find out how to do this for excluding cells containing a pattern.

How could I use a general function to only keep those elements from a vector which do not contain PATTERN?

I prefer not to give an example, as this might lead people to answer using other (though usually nice) ways than the intended one: excluding based on a regular expression. Here goes anyway:

How to exclude all the elements that contain any of the following characters: 'pyfgcrl

vector <- c("Cecilia", "Cecily", "Cecily's", "Cedric", "Cedric's", "Celebes", 
            "Celebes's", "Celeste", "Celeste's", "Celia", "Celia's", "Celina")

The result would be an empty vector in this case.

like image 326
PascalVKooten Avatar asked Jul 07 '13 11:07

PascalVKooten


1 Answers

Edit: From the comments, and with a little testing, one would find that my suggestion wasn't correct.

Here are two correct solutions:

vector[!grepl("['pyfgcrl]", vector)]                    ## kohske
grep("['pyfgcrl]", vector, value = TRUE, invert = TRUE) ## flodel

If either of them wants to re-post and accept credit for their answer, I'm more than happy to delete mine here.


Explanation

The general function that you are looking for is grepl. From the help file for grepl:

grepl returns a logical vector (match or not for each element of x).

Additionally, you should read the help page for regex which describes what character classes are. In this case, you create a character class ['pyfgcrl], which says to look for any character in the square brackets. You can then negate this with !.

So, up to this point, we have something that looks like:

!grepl("['pyfgcrl]", vector)

To get what you are looking for, you subset as usual.

vector[!grepl("['pyfgcrl]", vector)]

For the second solution, offered by @flodel, grep by default returns the position where a match is made, and the value = TRUE argument lets you return the actual string value instead. invert = TRUE means to return the values that were not matched.

like image 161
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 21 '22 12:10

A5C1D2H2I1M1N2O1R2T1