I just gone through the concept Zero-Width Assertions
from the documentation. And some quick questions comes into my mind-
Zero-Width Assertions
?Look-ahead
and look-behind
concept supports such
Zero-Width Assertions
concept??<=s
,<!s
,=s
,<=s
- 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going onI also tried some tiny codes to understand the logic, but not that much confident with the output of those:
irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
Can anyone help me here to understand?
EDIT
Here i have tried two snippets one with "Zero-Width Assertions" concepts as below:
irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
and the other is without "Zero-Width Assertions" concepts as below:
irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"
Both the above produces same output,now internally how the both regexp
move by their own to produce output- could you help me to visualize?
Thanks
Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser's current position. They don't match anything. Hence, they are called as zero-width assertions.
Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position. They don't match anything.
Zero-width or zero-length assertion in regular expressions means that there is a zero-length match that does not change the current position of the pointer in the input string.
The good news is that you can use lookbehind anywhere in the regex, not only at the start. If you want to find a word not ending with an “s”, you could use \b\w+(? <! s)\b.
Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a
, this means: "if there's a letter a
in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a
has a "width" of one character.
A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.
You're probably already familiar with some simpler zero-width assertions, like ^
and $
. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.
Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.
Consider:
/(?=foo)foo/.match 'foo'
This will match! The regex engine goes like this:
|foo
.(?=foo)
. This means: only match if foo
appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo
.f
. Is there an f
in front of the cursor? Yes, so proceed, and move the cursor past the f
: f|oo
.o
. Is there an o
in front of the cursor? Yes, so proceed, and move the cursor past the o
: fo|o
.foo|
.On your four assertions in particular:
(?=...)
is "lookahead"; it asserts that ...
does appear after the cursor.
1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
=> "slump june"
The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.
Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b
will never match anything, because it checks that the next character is a
, then also checks that the same character is b
, which is impossible.
(?<=...)
is "lookbehind"; it asserts that ...
does appear before the cursor.
1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
=> "five flour"
The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.
Like above, you have to be careful with what you put before it. a(?<=b)
will never match, because it checks that the next character is a
, moves the cursor, then checks that the previous character was b
.
(?!...)
is "negative lookahead"; it asserts that ...
does not appear after the cursor.
1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
=> "kid children"
"child" matches, because what comes next is a space, not "ren". "children" doesn't.
This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.
(?<!...)
is "negative lookbehind"; it asserts that ...
does not appear before the cursor.
1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
=> "feet root"
The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".
As an additional restriction, most regex engines require that ...
has a fixed length in this case. So you can't use ?
, +
, *
, or {n,m}
.
You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)
Afterthought: The syntax comes from Perl regular expressions, which used (?
followed by various symbols for a lot of extended syntax because ?
on its own is invalid. So <=
doesn't mean anything by itself; (?<=
is one entire token, meaning "this is the start of a lookbehind". It's like how +=
and ++
are separate operators, even though they both start with +
.
They're easy to remember, though: =
indicates looking forwards (or, really, "here"), <
indicates looking backwards, and !
has its traditional meaning of "not".
Regarding your later examples:
irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"
Yes, these produce the same output. This is that tricky bit with using lookahead:
fores|ight
.(?!s)
. Is the character after the cursor s
? No, it's i
! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight
.ight
. Does ight
come after the cursor? Well, yes, it does, so move the cursor: foresight|
.The cursor moved over the substring ight
, so that's the full match, and that's what gets replaced.
Doing (?!a)b
is useless, since you're saying: the next character must not be a
, and it must be b
. But that's the same as just matching b
!
This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d
will match any digit that isn't a 3.
This is what you want:
1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
This asserts that s
doesn't come before ight
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With