I just gone through the concept <code>Zero-Width Assertions</code> from the documentation. And some quick questions comes into my mind- <ul> <li>why such name <code>Zero-Width Assertions</code>?</li> <li>How the <code>Look-ahead</code> and <code>look-behind</code> concept supports such <code>Zero-Width Assertions</code> concept?</li> <li>What such <code>?<=s</code>,<code><!s</code>,<code>=s</code>,<code><=s</code> - 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on</li> </ul> I also tried some tiny codes to understand the logic, but not that much confident with the output of those: <pre class="prettyprint"><code>irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee') => "foresee" irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee') => "foresight" irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee') => "foresee" irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee') => "foresight" </code></pre> Can anyone help me here to understand? EDIT Here i have tried two snippets one with "Zero-Width Assertions" concepts as below: <pre class="prettyprint"><code>irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee') => "foresee" </code></pre> and the other is without "Zero-Width Assertions" concepts as below: <pre class="prettyprint"><code>irb(main):003:0> "foresight".sub(/ight/, 'ee') => "foresee" </code></pre> Both the above produces same output,now internally how the both <code>regexp</code> move by their own to produce output- could you help me to visualize? Thanks

Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like <code>a</code>, this means: "if there's a letter <code>a</code> in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that <code>a</code> has a "width" of one character. A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero. You're probably already familiar with some simpler zero-width assertions, like <code>^</code> and <code>$</code>. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is. Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor. Consider: <pre class="prettyprint"><code>/(?=foo)foo/.match 'foo' </code></pre> This will match! The regex engine goes like this: <ol> <li>Start at the beginning of the string: <code>|foo</code>.</li> <li>The first part of the regex is <code>(?=foo)</code>. This means: only match if <code>foo</code> appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have <code>|foo</code>.</li> <li>Next is <code>f</code>. Is there an <code>f</code> in front of the cursor? Yes, so proceed, and move the cursor past the <code>f</code>: <code>f|oo</code>.</li> <li>Next is <code>o</code>. Is there an <code>o</code> in front of the cursor? Yes, so proceed, and move the cursor past the <code>o</code>: <code>fo|o</code>.</li> <li>Same thing again, bringing us to <code>foo|</code>.</li> <li>We reached the end of the regex, and nothing failed, so the pattern matches.</li> </ol> On your four assertions in particular: <ul> <li> <code>(?=...)</code> is "lookahead"; it asserts that <code>...</code> does appear after the cursor. <pre class="prettyprint"><code>1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu') => "slump june" </code></pre> The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone. Since it doesn't move the cursor, you have to be careful when putting anything after it. <code>(?=a)b</code> will never match anything, because it checks that the next character is <code>a</code>, then also checks that the same character is <code>b</code>, which is impossible. </li> <li> <code>(?<=...)</code> is "lookbehind"; it asserts that <code>...</code> does appear before the cursor. <pre class="prettyprint"><code>1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive') => "five flour" </code></pre> The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match. Like above, you have to be careful with what you put before it. <code>a(?<=b)</code> will never match, because it checks that the next character is <code>a</code>, moves the cursor, then checks that the previous character was <code>b</code>. </li> <li> <code>(?!...)</code> is "negative lookahead"; it asserts that <code>...</code> does not appear after the cursor. <pre class="prettyprint"><code>1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid') => "kid children" </code></pre> "child" matches, because what comes next is a space, not "ren". "children" doesn't. This is probably the one I get the most use out of; finely controlling what can't come next comes in handy. </li> <li> <code>(?<!...)</code> is "negative lookbehind"; it asserts that <code>...</code> does not appear before the cursor. <pre class="prettyprint"><code>1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet') => "feet root" </code></pre> The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r". As an additional restriction, most regex engines require that <code>...</code> has a fixed length in this case. So you can't use <code>?</code>, <code>+</code>, <code>*</code>, or <code>{n,m}</code>. </li> </ul> You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :) <hr> Afterthought: The syntax comes from Perl regular expressions, which used <code>(?</code> followed by various symbols for a lot of extended syntax because <code>?</code> on its own is invalid. So <code><=</code> doesn't mean anything by itself; <code>(?<=</code> is one entire token, meaning "this is the start of a lookbehind". It's like how <code>+=</code> and <code>++</code> are separate operators, even though they both start with <code>+</code>. They're easy to remember, though: <code>=</code> indicates looking forwards (or, really, "here"), <code><</code> indicates looking backwards, and <code>!</code> has its traditional meaning of "not". <hr> Regarding your later examples: <pre class="prettyprint"><code>irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee') => "foresee" irb(main):003:0> "foresight".sub(/ight/, 'ee') => "foresee" </code></pre> Yes, these produce the same output. This is that tricky bit with using lookahead: <ol> <li>The regex engine has tried some things, but they haven't worked, and now it's at <code>fores|ight</code>.</li> <li>It checks <code>(?!s)</code>. Is the character after the cursor <code>s</code>? No, it's <code>i</code>! So that part matches and the matching continues, but the cursor doesn't move, and we still have <code>fores|ight</code>.</li> <li>It checks <code>ight</code>. Does <code>ight</code> come after the cursor? Well, yes, it does, so move the cursor: <code>foresight|</code>.</li> <li>We're done!</li> </ol> The cursor moved over the substring <code>ight</code>, so that's the full match, and that's what gets replaced. Doing <code>(?!a)b</code> is useless, since you're saying: the next character must not be <code>a</code>, and it must be <code>b</code>. But that's the same as just matching <code>b</code>! This can be useful sometimes, but you need a more complex pattern: for example, <code>(?!3)\d</code> will match any digit that isn't a 3. This is what you want: <pre class="prettyprint"><code>1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee') => "foresight" </code></pre> This asserts that <code>s</code> doesn't come before <code>ight</code>.

How the Look-ahead and Look-behind concept supports such Zero-Width Assertions concept in Regex of Ruby?

Q: What is look ahead in regex?

Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser's current position. They don't match anything. Hence, they are called as zero-width assertions.

Q: What is look behind regex?

Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position. They don't match anything.

Q: What is a zero-width assertion?

Zero-width or zero-length assertion in regular expressions means that there is a zero-length match that does not change the current position of the pointer in the input string.

Q: Can I use look behind regex?

The good news is that you can use lookbehind anywhere in the regex, not only at the start. If you want to find a word not ending with an “s”, you could use \b\w+(? <! s)\b.

Tags:

regex

ruby

ruby-1.9.3

I just gone through the concept Zero-Width Assertions from the documentation. And some quick questions comes into my mind-

why such name Zero-Width Assertions?
How the Look-ahead and look-behind concept supports such Zero-Width Assertions concept?
What such ?<=s,<!s,=s,<=s - 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on

I also tried some tiny codes to understand the logic, but not that much confident with the output of those:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Can anyone help me here to understand?

EDIT

Here i have tried two snippets one with "Zero-Width Assertions" concepts as below:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

and the other is without "Zero-Width Assertions" concepts as below:

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Both the above produces same output,now internally how the both regexp move by their own to produce output- could you help me to visualize?

Thanks

449

asked Jan 17 '13 20:01

Arup Rakshit

1 Answers

Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.

A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.

You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.

Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.

Consider:

/(?=foo)foo/.match 'foo'

This will match! The regex engine goes like this:

Start at the beginning of the string: |foo.
The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
Same thing again, bringing us to foo|.
We reached the end of the regex, and nothing failed, so the pattern matches.

On your four assertions in particular:

(?=...) is "lookahead"; it asserts that ... does appear after the cursor.
```
1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
 => "slump june" 
```
The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.

Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.
(?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.
```
1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
 => "five flour" 
```
The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.

Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.
(?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.
```
1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
 => "kid children"
```
"child" matches, because what comes next is a space, not "ren". "children" doesn't.

This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.
(?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.
```
1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
 => "feet root" 
```
The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".

As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.

You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)

Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.

They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".

Regarding your later examples:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Yes, these produce the same output. This is that tricky bit with using lookahead:

The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
We're done!

The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.

Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!

This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.

This is what you want:

1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
 => "foresight"

This asserts that s doesn't come before ight.

answered Sep 30 '22 17:09

Eevee

Related questions
                            
                                puppet template remove the last comma
                            
                                Ruby gem to quickly validate partial HTML snippets?
                            
                                Travis reports odd message of corrupted Gemfile.lock
                            
                                Check if two ranges overlap in ruby
                            
                                Rake or Rails db:migrate
                            
                                How to read an INI file in ruby
                            
                                Removing text within parentheses (parentheses within parentheses prob)
                            
                                digest/hmac is part of ruby standard lib
                            
                                ruby module as collection of methods
                            
                                Why does 6.times.map work in ruby 1.8.7 but not 1.8.6
                            
                                remote form_tag in rails3 without a named route
                            
                                Rails3 ActiveRecord::StatementInvalid:... no such table in every test
                            
                                Why doesn't Ruby have a ThreadPool built-in?
                            
                                Confused with Ruby's <=> operator
                            
                                Is there a guide to Rails for experienced Rubyists?
                            
                                Could not find RubyGem bundler
                            
                                Rails 3- Active Admin (Formtastic), set column Width
                            
                                String split by two different delimiters
                            
                                How to correctly iterate through params
                            
                                Rails 3: rollback for after_create

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With