I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package.
Here is an example:
string <- c("MARKETING: Vice President", "FINANCE: Accountant I",
"OPERATIONS: Plant Manager")
pattern <- "[A-Z]+(?=:)"
test <- gsub(" ","",string)
results <- str_extract(test, pattern)
This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using:
pattern <- "[A-Z]+(:)"
test <- gsub(" ","",string)
results <- gsub(":","",str_extract(test, pattern))
But I anticipate that I might need to use lookarounds for more complex situations than this in the near future.
Do I need to amend the regex with some escapes or something to make this work?
Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE . There is also fixed = TRUE which can be considered to use a literal regular expression.
By default R uses POSIX extended regular expressions, though if extended is set to FALSE , it will use basic POSIX regular expressions. If perl is set to TRUE , R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.
The \r metacharacter matches carriage return characters.
For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
Lookahead assertions require you to identify the regular expression as a perl regular expression in R.
str_extract(string, perl(pattern))
# [1] "MARKETING" "FINANCE" "OPERATIONS"
You can also do this easily in base R:
regmatches(string, regexpr(pattern, string, perl=TRUE))
# [1] "MARKETING" "FINANCE" "OPERATIONS"
regexpr
finds the matches and regmatches
use the match data to extract the substrings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With