Going where PHP parse_url() doesn't - Parsing only the domain

Question

PHP's parse_url() has a host field, which includes the full host. I'm looking for the most reliable (and least costly) way to only return the domain and TLD.

Given the examples:

http://www.google.com/foo, parse_url() returns www.google.com for host
http://www.google.co.uk/foo, parse_url() returns www.google.co.uk for host

I am looking for only google.com or google.co.uk. I have contemplated a table of valid TLD's/suffixes and only allowing those and one word. Would you do it any other way? Does anyone know of a pre-canned valid REGEX for this sort of thing?

Martin B. · Accepted Answer

There is also a very nice port of Python's tldextract module http://w-shadow.com/blog/2012/08/28/tldextract - this goes beyond parse_url and allows you to actually get the domain/tld out, without the subdomain.

From the module website:

$components = tldextract('http://www.bbc.co.uk');
echo $components->subdomain; // www
echo $components->domain;    // bbc
echo $components->tld;       // co.uk

lpfavreau · Answer

How about something like that?

function getDomain($url) {
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

Will extract the domain name using the classic parse_url and then look for a valid domain without any subdomain (www being a subdomain). Won't work on things like 'localhost'. Will return false if it didn't match anything.

// Edit:

Try it out with:

echo getDomain('http://www.google.com/test.html') . '<br/>';
echo getDomain('https://news.google.co.uk/?id=12345') . '<br/>';
echo getDomain('http://my.subdomain.google.com/directory1/page.php?id=abc') . '<br/>';
echo getDomain('https://testing.multiple.subdomain.google.co.uk/') . '<br/>';
echo getDomain('http://nothingelsethan.com') . '<br/>';

And it should return:

google.com
google.co.uk
google.com
google.co.uk
nothingelsethan.com

Of course, it won't return anything if it doesn't get through parse_url, so make sure it's a well-formed URL.

// Addendum:

Alnitak is right. The solution presented above will work in most cases but not necessarily all and needs to be maintained to make sure, for example, that their aren't new TLD with .morethan6characters and so on. The only reliable way of extracting the domain is to use a maintained list such as http://publicsuffix.org/. It's more painful at first but easier and more robust on the long-term. You need to make sure you understand the pros and cons of each method and how it fits with your project.

Alnitak · Answer

Currently the only "right" way to do this is to use a list such as that maintained at http://publicsuffix.org/

BTW, this question is also pretty much a duplicate of:

Can I improve this regex check for valid domain names?
Get the subdomain from a URL

There are standardisation efforts at IETF looking at DNS methods of declaring whether a particular node in the DNS tree is used for "public" registrations, but they're in their early stages of development. All of the popular non-IE browsers use the publicsuffix.org list.

Going where PHP parse_url() doesn't - Parsing only the domain

Tags:

php

dns

Gavin M. Roy

3 Answers

Martin B.

lpfavreau

Alnitak

Recent Activity

Donate For Us

Going where PHP parse_url() doesn't - Parsing only the domain

Tags:

php

dns

Gavin M. Roy

3 Answers

Martin B.

lpfavreau

Alnitak

Related questions

Recent Activity

Donate For Us