I don't really know how to correctly find words using non-matching regular expression in R
Eg: the data includes:
x = c("hail", "small hail", "wind hail", "deep hail", "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")
I want to find all obs having "hail" but not having "marine"
My attempt:
x[grep("[^(marine)] hail", x)]
-> I have only 5:
"small hail" "wind hail" "deep hail" "tstm wind hail" "gusty wind hail"
I don't know what happens with the other 4
To represent this, we use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.
Definition and Usage. The ?! n quantifier matches any string that is not followed by a specific string n. Tip: Use the ?= n quantifier to match any string that IS followed by a specific string n.
You can use negated character classes to exclude certain characters: for example [^abcde] will match anything but a,b,c,d,e characters.
Use lookaround assertions.
> x = c("hail", "small hail", "wind hail", "deep hail", "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")
> x[grep("^(?=.*hail)(?!.*marine)", x, perl=TRUE)]
[1] "hail" "small hail" "wind hail"
[4] "deep hail" "thunderstorm hail" "tstm wind hail"
[7] "gusty wind hail" "late season hail" "non severe hail"
OR
Add \b boundaries if necessary. \b matches between a word character and a non-word character.
> x[grep("^(?=.*\\bhail\\b)(?!.*\\bmarine\\b)", x, perl=TRUE)]
^ Asserts that we are at the start.
(?=.*hail) Positive lookahead which asserts that the match must contain the string hail
(?!.*marine) Negative lookahead which asserts that the match won't contain the string marine.
So the above regex would match the starting anchor or the start of the line only if both conditions are satisfied.
You want to use a lookahead assertion in this circumstance. The current implementation of your negated character class does not do what you expect, instead it matches the following:
[^(marine)] # any character except: '(', 'm', 'a', 'r', 'i', 'n', 'e', ')'
hail # ' hail'
To fix this, you could simply do:
> x[grep('^(?!.*marine).*hail', x, perl=TRUE)]
# [1] "hail" "small hail" "wind hail"
# [4] "deep hail" "thunderstorm hail" "tstm wind hail"
# [7] "gusty wind hail" "late season hail" "non severe hail"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With