I'm working on a regex for validating urls in C#. Right now, the regex I need must not match other http://
but the first one inside the url. This was my first try:
(https?:\/\/.+?)\/(.+?)(?!https?:\/\/)
But this regex does not work (even removing (?!https?:\/\/)
). Take for example this input string:
http://test.test/notwork.http://test
Here is my first doubt: why does not the capturing group (.+?)
match notwork.http://test
? The lazy quantifier should match as few times as possible but why not until the end? In this case I was certainly missing something (Firstly I thought it could be related to backtracking but I don't think this is the case), so I read this and found a solution, even if I'm not sure is the best one since it says that
This technique presents no advantage over the lazy dot-star
Anyway, that solution is the tempered dot. This is my next try:
(https?:\/\/.+?)\/((?:(?!https?:\/\/).)*)
Now: this regex is working but not in the way I would like. I need a match only when the url is valid.
By the way, I think I haven't fully understood what the new regex is doing: why the negative lookahead stays before the .
and not after it?
So I tried moving it after the .
and it seems that it matches the url until it finds the second-to-last character before the second http. Returning to the corrected regex, my hypothesis is that the negative lookahead is actually trying to check what's after the .
already read by the regex, is this right?
Other solutions are well-accepted, but I'd firstly prefer to understand this one. Thank you.
The solution you seek is
(?>https?://\S+?/(?:(?!https?://).)*)(?!https?://)
See the regex demo
Details
(?>https?://\S+?/(?:(?!https?://).)*)
- an atomic group (allowing no backtracking into its subpatterns) that matches
https?://
- http://
or https://
\S+?
- any 1 or more non-whitespace chars, as few as possible, up to the first.../
- /
symbol followed with...(?:(?!https?://).)*
- zero or more chars (as many as possible) that do not start a sequence of http://
or https://
chars.(?!https?://)
- a negative lookahead failing the match if there is http://
or https://
immediately to the right of the current location.The (https?:\/\/.+?)\/(.+?)(?!https?:\/\/)
does not work because the .+?
pattern is matching lazily, i.e. it grabs the first char it finds, then lets the subsequent subpattern match. The subsequent subpattern is a negative loolahead that fails the match only in case there is no http://
or https://
immediately to the right of the current location. As there is no such a substring after n
in http://test.test/notwork.http://test
, the match ending with n
is returned, the match succeeds. If you do not tell the regex engine to match more, or up to some other delimiter/pattern, it won't.
The tempered greedy token solution has been talked over a lot. The exact doubt as to where to place the lookahead is covered in this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With