I don't really know how to correctly find words using non-matching regular expression in R
Eg: the data includes:
x = c("hail", "small hail", "wind hail", "deep hail", "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")
I want to find all obs having "hail" but not having "marine"
My attempt:
x[grep("[^(marine)] hail", x)]
-> I have only 5:
"small hail" "wind hail" "deep hail" "tstm wind hail" "gusty wind hail"
I don't know what happens with the other 4
To represent this, we use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.
Definition and Usage. The ?! n quantifier matches any string that is not followed by a specific string n. Tip: Use the ?= n quantifier to match any string that IS followed by a specific string n.
You can use negated character classes to exclude certain characters: for example [^abcde] will match anything but a,b,c,d,e characters.
Use lookaround assertions.
> x = c("hail", "small hail", "wind hail", "deep hail", "thunderstorm hail", "tstm wind hail", "gusty wind hail", "late season hail", "non severe hail", "marine hail")
> x[grep("^(?=.*hail)(?!.*marine)", x, perl=TRUE)]
[1] "hail" "small hail" "wind hail"
[4] "deep hail" "thunderstorm hail" "tstm wind hail"
[7] "gusty wind hail" "late season hail" "non severe hail"
OR
Add \b
boundaries if necessary. \b
matches between a word character and a non-word character.
> x[grep("^(?=.*\\bhail\\b)(?!.*\\bmarine\\b)", x, perl=TRUE)]
^
Asserts that we are at the start.
(?=.*hail)
Positive lookahead which asserts that the match must contain the string hail
(?!.*marine)
Negative lookahead which asserts that the match won't contain the string marine
.
So the above regex would match the starting anchor or the start of the line only if both conditions are satisfied.
You want to use a lookahead assertion in this circumstance. The current implementation of your negated character class does not do what you expect, instead it matches the following:
[^(marine)] # any character except: '(', 'm', 'a', 'r', 'i', 'n', 'e', ')'
hail # ' hail'
To fix this, you could simply do:
> x[grep('^(?!.*marine).*hail', x, perl=TRUE)]
# [1] "hail" "small hail" "wind hail"
# [4] "deep hail" "thunderstorm hail" "tstm wind hail"
# [7] "gusty wind hail" "late season hail" "non severe hail"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With