Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lookaround lookbefore regex for R

Tags:

regex

r

I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package.

Here is an example:

string <- c("MARKETING:  Vice President", "FINANCE:  Accountant I",
"OPERATIONS: Plant Manager")

pattern <- "[A-Z]+(?=:)"
test <- gsub(" ","",string)
results <- str_extract(test, pattern)

This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using:

pattern <- "[A-Z]+(:)"
test <- gsub(" ","",string)
results <- gsub(":","",str_extract(test, pattern))

But I anticipate that I might need to use lookarounds for more complex situations than this in the near future.

Do I need to amend the regex with some escapes or something to make this work?

like image 509
exl Avatar asked Jan 03 '13 15:01

exl


People also ask

Can you use regex in R?

Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE . There is also fixed = TRUE which can be considered to use a literal regular expression.

What kind of regex does R use?

By default R uses POSIX extended regular expressions, though if extended is set to FALSE , it will use basic POSIX regular expressions. If perl is set to TRUE , R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.

What does \r represent in regex?

The \r metacharacter matches carriage return characters.

What is a ZA Z in regex?

For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.


1 Answers

Lookahead assertions require you to identify the regular expression as a perl regular expression in R.

str_extract(string, perl(pattern))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

You can also do this easily in base R:

regmatches(string, regexpr(pattern, string, perl=TRUE))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

regexpr finds the matches and regmatches use the match data to extract the substrings.

like image 181
Matthew Plourde Avatar answered Sep 21 '22 00:09

Matthew Plourde