Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match back to a period or start of string

Tags:

python

regex

I'd like to match a word, then get everything before it up to the first occurance of a period or the start of the string.

For example, given this string and searching for the word "regex":

s = 'Do not match this. Or this. Or this either. I like regex. It is hard, but regex is also rewarding.'

It should return:

>> I like regex.
>> It is hard, but regex is also rewarding.

I'm trying to get my head around look-aheads and look-behinds, but (it seems) you can't easily look back until you hit something, only if it's immediately next to your pattern. I can get pretty close with this:

pattern = re.compile(r'(?:(?<=\.)|(?<=^))(.*?regex.*?\.)')

But it gives me the first period, then everything up to "regex":

>> Do not match this. Or this. Or this either. I like regex.  # no!
>> It is hard, but regex is also rewarding.                   # correct
like image 909
JeffThompson Avatar asked Jul 20 '17 00:07

JeffThompson


People also ask

How do you match periods in regex?

The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.

How do I specify start and end in regex?

The caret ^ and dollar $ characters have special meaning in a regexp. They are called “anchors”. The caret ^ matches at the beginning of the text, and the dollar $ – at the end. The pattern ^Mary means: “string start and then Mary”.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


1 Answers

You don't need to use lookarounds to do that. The negated character class is your best friend:

(?:[^\s.][^.]*)?regex[^.]*\.?

or

[^.]*regex[^.]*\.?

this way you take any characters before the word "regex" and forbids any of these characters to be a dot.

The first pattern stripes white-spaces on the left, the second one is more basic.

About your pattern:

Don't forget that a regex engine tries to succeed at each position from the left to the right of the string. That's why something like (?:(?<=\.)|(?<=^)).*?regex doesn't always return the shortest substring between a dot or the start of the string and the word "regex", even if you use a non-greedy quantifier. The leftmost position always wins and a non-greedy quantifier takes characters until the next subpattern succeeds.

As an aside, one more time, the negated character class can be useful:
to shorten (?:(?<=\.)|(?<=^)) you can write (?<![^.])

like image 152
Casimir et Hippolyte Avatar answered Oct 20 '22 10:10

Casimir et Hippolyte