Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression excluding word in R

Tags:

regex

r

I don't really know how to correctly find words using non-matching regular expression in R

Eg: the data includes:

x =  c("hail", "small hail", "wind hail",  "deep hail",  "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")

I want to find all obs having "hail" but not having "marine"

My attempt:

x[grep("[^(marine)] hail", x)]

-> I have only 5:

"small hail"      "wind hail"       "deep hail"       "tstm wind hail"  "gusty wind hail"

I don't know what happens with the other 4

like image 666
Duy Bui Avatar asked Jan 16 '15 14:01

Duy Bui


People also ask

How do you exclude words in regex?

To represent this, we use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.

What is ?! In regex?

Definition and Usage. The ?! n quantifier matches any string that is not followed by a specific string n. Tip: Use the ?= n quantifier to match any string that IS followed by a specific string n.

How do I not match a character in regex?

You can use negated character classes to exclude certain characters: for example [^abcde] will match anything but a,b,c,d,e characters.


2 Answers

Use lookaround assertions.

> x =  c("hail", "small hail", "wind hail",  "deep hail",  "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")
> x[grep("^(?=.*hail)(?!.*marine)", x, perl=TRUE)]
[1] "hail"              "small hail"        "wind hail"        
[4] "deep hail"         "thunderstorm hail" "tstm wind hail"   
[7] "gusty wind hail"   "late season hail"  "non severe hail" 

OR

Add \b boundaries if necessary. \b matches between a word character and a non-word character.

> x[grep("^(?=.*\\bhail\\b)(?!.*\\bmarine\\b)", x, perl=TRUE)]
  • ^ Asserts that we are at the start.

  • (?=.*hail) Positive lookahead which asserts that the match must contain the string hail

  • (?!.*marine) Negative lookahead which asserts that the match won't contain the string marine.

  • So the above regex would match the starting anchor or the start of the line only if both conditions are satisfied.

like image 61
Avinash Raj Avatar answered Oct 04 '22 14:10

Avinash Raj


You want to use a lookahead assertion in this circumstance. The current implementation of your negated character class does not do what you expect, instead it matches the following:

[^(marine)]  # any character except: '(', 'm', 'a', 'r', 'i', 'n', 'e', ')'
 hail        # ' hail'

To fix this, you could simply do:

> x[grep('^(?!.*marine).*hail', x, perl=TRUE)]
# [1] "hail"              "small hail"        "wind hail"        
# [4] "deep hail"         "thunderstorm hail" "tstm wind hail"   
# [7] "gusty wind hail"   "late season hail"  "non severe hail"
like image 34
hwnd Avatar answered Oct 04 '22 15:10

hwnd