Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Capturing first occurrence before lookahead

Tags:

regex

I'm trying to capture the urls before a particular word. The only trouble is that the word could also be part of the domain.

Examples: (i'm trying to capture everything before dinner)

https://breakfast.example.com/lunch/dinner/

https://breakfast.example.brunch.com:8080/lunch/dinner

http://dinnerdemo.example.com/dinner/

I am able to use:

^(.*://.*/)(?=dinner/?)

The trouble I am having is the lookahead doesn't appear to by lazy enough So the following is failing:

https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/

as it captures:

https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/

I'm both failing to understand why and how to fix my regex. Perhaps I'm on the wrong track but how can I capture all my examples?

like image 939
Brandon Avatar asked Jun 25 '14 22:06

Brandon


2 Answers

You can use some laziness:

^(.*?:\/\/).*?/(?=dinner/?)

Live demo

By using a .* in the middle of your regex you ate everything until the last colon, where it found a match.

.* in the middle of a regex, by the way, is very bad practice. It can cause horrendous backtracking performance degradation in long strings. .*? is better, since it is reluctant rather than greedy.

like image 92
Ray Toal Avatar answered Sep 29 '22 08:09

Ray Toal


The lookahead doesn't have to be lazy or not, the lookahead is only a check and in your case with a quasi-fixed string.

What you need to make lazy is obviously the subpattern before the lookahead.

^https?:\/\/(?:[^\/]+\/)*?(?=dinner(?:\/|$))

Note: (?:/|$) is like a boundary that ensures the word "dinner" is followed by a slash or the end of the string.

like image 33
Casimir et Hippolyte Avatar answered Sep 29 '22 08:09

Casimir et Hippolyte