Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex match main domain name

Tags:

regex

I need to be able to identify a domain name of any subdomain.

Examples:

For all of thiese I need to match only example.co / example.com / example.org / example.co.uk / example.com.au / example.gov.us and so on

www.example.co
www.first.example.co
first.example.co
second.first.example.co
no.matter.how.many.example.co
first.example.co.uk
second.first.example.co.uk
no.matter.how.many.example.co.uk
first.example.org
second.first.example.org
no.matter.how.many.example.org
first.example.gov.uk
second.first.example.gov.uk
no.matter.how.many.example.gov.uk

I have been playing with regular expressions all day and been Googleing for something all day long and still can't seem to find something.

Edit2: I prefer a regex that might fail for very odd cases like t.co then list all TLD's and have the ones I did not list but could have been predicted fail and match more then it should. Isn't this be the option you would chose?

Update: Using the chosen answer as a guide I have constructed this regex that does the job for me.

/([0-9a-z-]{2,}\.[0-9a-z-]{2,3}\.[0-9a-z-]{2,3}|[0-9a-z-]{2,}\.[0-9a-z-]{2,3})$/i

It might not be perfect but so far I have not encountered a case where it fails.

like image 334
transilvlad Avatar asked Oct 07 '12 20:10

transilvlad


2 Answers

This will match:

([0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3}\.[0-9A-Za-z]{2,3}|[0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3})$

as long as:

  1. there're no extra spaces at the end of each line
  2. all domain codes used are short, two or three letters long. Wil not work with long domain codes like .info.

Bassically what it does is match any of these two:

  1. word two letters or longer:dot:two or three letters word:dot:two or three letters word:end of line
  2. word two letters or longer:dot:two or three letters word:end of line

Short version:

(\w{2,}\.\w{2,3}\.\w{2,3}|\w{2,}\.\w{2,3})$

If you want it to only match whole lines, then add ^ at the beginning

This is how I tested it:

enter image description here

like image 141
Tulains Córdova Avatar answered Oct 04 '22 05:10

Tulains Córdova


If you want an absolutely correct matcher, regular expressions are not the way to go.

Why?

  • Because both of these are valid domains + TLDs: goo.gl, t.co.

  • Because neither of these are (they're only TLDs): com.au, co.uk.

Any regex that you might create that would properly handle all of the above cases would simply amount to listing out the valid TLDs, which would defeat the purpose of using regular expressions in the first place.

Instead, just create/obtain a list of the current TLDs and see which one of them is present, then add the first segment before it.

like image 44
Amber Avatar answered Oct 04 '22 07:10

Amber