I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package. Here is an example: <pre class="prettyprint"><code>string <- c("MARKETING: Vice President", "FINANCE: Accountant I", "OPERATIONS: Plant Manager") pattern <- "[A-Z]+(?=:)" test <- gsub(" ","",string) results <- str_extract(test, pattern) </code></pre> This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using: <pre class="prettyprint"><code>pattern <- "[A-Z]+(:)" test <- gsub(" ","",string) results <- gsub(":","",str_extract(test, pattern)) </code></pre> But I anticipate that I might need to use lookarounds for more complex situations than this in the near future. Do I need to amend the regex with some escapes or something to make this work?

Lookahead assertions require you to identify the regular expression as a perl regular expression in R. <pre class="prettyprint"><code>str_extract(string, perl(pattern)) # [1] "MARKETING" "FINANCE" "OPERATIONS" </code></pre> You can also do this easily in base R: <pre class="prettyprint"><code>regmatches(string, regexpr(pattern, string, perl=TRUE)) # [1] "MARKETING" "FINANCE" "OPERATIONS" </code></pre> <code>regexpr</code> finds the matches and <code>regmatches</code> use the match data to extract the substrings.

Lookaround lookbefore regex for R

Tags:

regex

r

I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package.

Here is an example:

string <- c("MARKETING:  Vice President", "FINANCE:  Accountant I",
"OPERATIONS: Plant Manager")

pattern <- "[A-Z]+(?=:)"
test <- gsub(" ","",string)
results <- str_extract(test, pattern)

This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using:

pattern <- "[A-Z]+(:)"
test <- gsub(" ","",string)
results <- gsub(":","",str_extract(test, pattern))

But I anticipate that I might need to use lookarounds for more complex situations than this in the near future.

Do I need to amend the regex with some escapes or something to make this work?

509

asked Jan 03 '13 15:01

exl

1 Answers

Lookahead assertions require you to identify the regular expression as a perl regular expression in R.

str_extract(string, perl(pattern))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

You can also do this easily in base R:

regmatches(string, regexpr(pattern, string, perl=TRUE))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

regexpr finds the matches and regmatches use the match data to extract the substrings.

181

answered Sep 21 '22 00:09

Matthew Plourde

Related questions
                            
                                regular expression needed to remove C/C# comments
                            
                                Should this regex pattern throw an exception?
                            
                                Differences between regular expressions in Java and ECMA-262 (AS, JS)
                            
                                Using regex to match any character until a substring is reached?
                            
                                Python regular expressions matching within set
                            
                                How to remove / replace ANSI color codes from a string in Javascript
                            
                                Is it possible to generate a (compact) regular expression for an anagram of an arbitrary string?
                            
                                Regex: Substring the second last value between two slashes of a url string
                            
                                How to Perform Operations on Regex Backreference Matches in Javascript?
                            
                                Using regular expressions to remove all the GOs in a sql script file
                            
                                regex to parse string with escaped characters
                            
                                Multiple substitutions of numbers in string using regex python
                            
                                JavaScript replace() Method: remove empty space just at the end and at the beginning of the string [duplicate]
                            
                                I would like to color highlight sections of a pipeline string according to a regex in Powershell
                            
                                pcre regex to match first two words, numbers
                            
                                How to split by double or more empty lines? Regex.Stplit adds unwanted strings
                            
                                Make at least one group mandatory
                            
                                C# Regex Match 15 Characters, Single Spaces, Alpha-Numeric
                            
                                vim visual block search/replace only replacing first occurrence on a line
                            
                                What is the best regular expression for phone numbers? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With