I'd like to match a word, then get everything before it up to the first occurance of a period or the start of the string.
For example, given this string and searching for the word "regex":
s = 'Do not match this. Or this. Or this either. I like regex. It is hard, but regex is also rewarding.'
It should return:
>> I like regex.
>> It is hard, but regex is also rewarding.
I'm trying to get my head around look-aheads and look-behinds, but (it seems) you can't easily look back until you hit something, only if it's immediately next to your pattern. I can get pretty close with this:
pattern = re.compile(r'(?:(?<=\.)|(?<=^))(.*?regex.*?\.)')
But it gives me the first period, then everything up to "regex":
>> Do not match this. Or this. Or this either. I like regex. # no!
>> It is hard, but regex is also rewarding. # correct
The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.
The caret ^ and dollar $ characters have special meaning in a regexp. They are called “anchors”. The caret ^ matches at the beginning of the text, and the dollar $ – at the end. The pattern ^Mary means: “string start and then Mary”.
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
You don't need to use lookarounds to do that. The negated character class is your best friend:
(?:[^\s.][^.]*)?regex[^.]*\.?
or
[^.]*regex[^.]*\.?
this way you take any characters before the word "regex" and forbids any of these characters to be a dot.
The first pattern stripes white-spaces on the left, the second one is more basic.
About your pattern:
Don't forget that a regex engine tries to succeed at each position from the left to the right of the string. That's why something like (?:(?<=\.)|(?<=^)).*?regex
doesn't always return the shortest substring between a dot or the start of the string and the word "regex", even if you use a non-greedy quantifier. The leftmost position always wins and a non-greedy quantifier takes characters until the next subpattern succeeds.
As an aside, one more time, the negated character class can be useful:
to shorten (?:(?<=\.)|(?<=^))
you can write (?<![^.])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With