Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Noncapturing along with capturing match

Tags:

c#

regex

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:

[a-z]{2,10}(?=\.mysite\.com)

The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.

I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.

It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,

(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)

but for some reason this does not capture text is a situation like:

afb"asdfunstuff.mysite.com

Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?

like image 665
rune711 Avatar asked Feb 20 '26 11:02

rune711


1 Answers

So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.

First problem can be solved by using a capturing group. The regex

([a-z]{2,10})\.mysite\.com

will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.

Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.

So, I suggest the following regex to solve your problem:

\b([a-z]{2,10})\.mysite\.com

Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):

\b(\w{2,10})\.mysite\.com

where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

like image 115
Jeremy Avatar answered Feb 21 '26 23:02

Jeremy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!