Regex: Capturing first occurrence before lookahead

Question

I'm trying to capture the urls before a particular word. The only trouble is that the word could also be part of the domain.

Examples: (i'm trying to capture everything before dinner)

https://breakfast.example.com/lunch/dinner/

https://breakfast.example.brunch.com:8080/lunch/dinner

http://dinnerdemo.example.com/dinner/

I am able to use:

^(.*://.*/)(?=dinner/?)

The trouble I am having is the lookahead doesn't appear to by lazy enough So the following is failing:

https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/

as it captures:

https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/

I'm both failing to understand why and how to fix my regex. Perhaps I'm on the wrong track but how can I capture all my examples?

Ray Toal · Accepted Answer

You can use some laziness:

^(.*?:\/\/).*?/(?=dinner/?)

Live demo

By using a .* in the middle of your regex you ate everything until the last colon, where it found a match.

.* in the middle of a regex, by the way, is very bad practice. It can cause horrendous backtracking performance degradation in long strings. .*? is better, since it is reluctant rather than greedy.

Casimir et Hippolyte · Answer

The lookahead doesn't have to be lazy or not, the lookahead is only a check and in your case with a quasi-fixed string.

What you need to make lazy is obviously the subpattern before the lookahead.

^https?:\/\/(?:[^\/]+\/)*?(?=dinner(?:\/|$))

Note: (?:/|$) is like a boundary that ensures the word "dinner" is followed by a slash or the end of the string.

Regex: Capturing first occurrence before lookahead

Tags:

regex

Brandon

2 Answers

Ray Toal

Casimir et Hippolyte

Recent Activity

Donate For Us

Regex: Capturing first occurrence before lookahead

Tags:

regex

Brandon

2 Answers

Ray Toal

Casimir et Hippolyte

Related questions

Recent Activity

Donate For Us