I am trying to use R to parse through a number of entries. I have two requirements for the the entries I want back. I want all the entries that contain the word <code>apple</code> but don't contain the word <code>orange</code>. For example: <ol> <li>I like apples </li> <li>I really like apples </li> <li>I like apples and oranges </li> </ol> I want to get entries 1 and 2 back. How could I go about using R to do this? Thanks.

This regex is a bit smaller and much faster than the other regex versions (see comparison below). I don't have the tools to compare to David's double <code>grepl</code> so if someone can compare the single <code>grep</code> below vs the double <code>grepl</code> we'll be able to know. The comparison must be done both for a success case and a failure case. <pre class="prettyprint"><code>^(?!.*orange).*apple.*$ </code></pre> <ol> <li>The negative lookahead ensures we don't have <code>orange</code> </li> <li>We just match the string, so long as it contains <code>apple</code>. No need for a lookahead there.</li> </ol> Code Sample <pre class="prettyprint"><code>grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE); </code></pre> Speed Comparison @hwnd has now removed that double lookahead version, but according to RegexBuddy the speed difference remains: <ol> <li>Against <code>I like apples and oranges</code>, the engine takes 22 steps to fail, vs. 143 for the double lookahead version <code>^(?=.*apple)((?!orange).)*$</code> and 22 steps for <code>^((?!.*orange).)*apple.*$</code> (equal there but wait for point 2). </li> <li>Against <code>I really like apples</code>, the engine takes 64 steps to succeed, vs. 104 for the double lookahead version <code>^(?=.*apple)((?!orange).)*$</code> and 538 steps for <code>^((?!.*orange).)*apple.*$</code>.</li> </ol> These numbers were provided by the RegexBuddy debugger.

Using a regular expression, you could do the following. <pre class="prettyprint"><code>x <- c('I like apples', 'I really like apples', 'I like apples and oranges', 'I like oranges and apples', 'I really like oranges and apples but oranges more') x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)] # [1] "I like apples" "I really like apples" </code></pre> The regular expression looks ahead to see if there's no character except a line break and no substring <code>orange</code> and if so, then the dot <code>.</code> will match any character except a line break as it is wrapped in a group, and repeated (<code>0</code> or more times). Next we look for <code>apple</code> and any character except a line break (<code>0</code> or more times). Finally, the start and end of line anchors are in place to make sure the input is consumed. <hr> UPDATE: You could use the following if performance is an issue. <pre class="prettyprint"><code>x[grepl('^(?!.*orange).*$', x, perl=TRUE)] </code></pre>

Regular expression that both includes and excludes certain strings in R

Tags:

regex

r

I am trying to use R to parse through a number of entries. I have two requirements for the the entries I want back. I want all the entries that contain the word apple but don't contain the word orange.

For example:

I like apples
I really like apples
I like apples and oranges

I want to get entries 1 and 2 back.

How could I go about using R to do this?

Thanks.

850

asked May 29 '14 21:05

janovak

3 Answers

This regex is a bit smaller and much faster than the other regex versions (see comparison below). I don't have the tools to compare to David's double grepl so if someone can compare the single grep below vs the double grepl we'll be able to know. The comparison must be done both for a success case and a failure case.

^(?!.*orange).*apple.*$

The negative lookahead ensures we don't have orange
We just match the string, so long as it contains apple. No need for a lookahead there.

Code Sample

grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE);

Speed Comparison

@hwnd has now removed that double lookahead version, but according to RegexBuddy the speed difference remains:

Against I like apples and oranges, the engine takes 22 steps to fail, vs. 143 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 22 steps for ^((?!.*orange).)*apple.*$ (equal there but wait for point 2).
Against I really like apples, the engine takes 64 steps to succeed, vs. 104 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 538 steps for ^((?!.*orange).)*apple.*$.

These numbers were provided by the RegexBuddy debugger.

172

answered Sep 21 '22 03:09

zx81

Could do

temp <- c("I like apples", "I really like apples", "I like apples and oranges")
temp[grepl("apple", temp) & !grepl("orange", temp)]

## [1] "I like apples"      "I really like apples"

answered Sep 21 '22 03:09

David Arenburg

Using a regular expression, you could do the following.

x <- c('I like apples', 'I really like apples', 
       'I like apples and oranges', 'I like oranges and apples',
       'I really like oranges and apples but oranges more')

x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)]
# [1] "I like apples"        "I really like apples"

The regular expression looks ahead to see if there's no character except a line break and no substring orange and if so, then the dot . will match any character except a line break as it is wrapped in a group, and repeated (0 or more times). Next we look for apple and any character except a line break (0 or more times). Finally, the start and end of line anchors are in place to make sure the input is consumed.

UPDATE: You could use the following if performance is an issue.

x[grepl('^(?!.*orange).*$', x, perl=TRUE)]

answered Sep 22 '22 03:09

hwnd

Related questions
                            
                                Replace non-numeric characters
                            
                                Regex to extract attribute value
                            
                                Bash Script Regular Expressions...How to find and replace all matches?
                            
                                Regex to match a specific group of digits of certain length?
                            
                                TextField Validation With Regular Expression
                            
                                Is regex in perl faster than in Java or other languages? [closed]
                            
                                What does .* do in regex?
                            
                                Can regex do this faster?
                            
                                Regular Expression | Leap Years and More
                            
                                Regex to match the URL last part with JavaScript
                            
                                find emails in a String [duplicate]
                            
                                How can I match a partial string to a database's object's attribute? Regexp?
                            
                                regex format string number with commas and 2 decimals in javascript
                            
                                What is proper RegEx expression for SWIFT codes?
                            
                                Split Java String into Two String using delimiter
                            
                                URI Regex: Replace http://, https://, ftp:// with empty string if URL valid
                            
                                remove all special characters in java [duplicate]
                            
                                Extract text between certain symbols using Regular Expression in R
                            
                                How do I assign the result of a regex match to a new variable, in a single line?
                            
                                I need a regular expression that only accepts text characters with spaces allowed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With