I have a URL which can be any of the following formats:
http://example.com https://example.com http://example.com/foo http://example.com/foo/bar www.example.com example.com foo.example.com www.foo.example.com foo.bar.example.com http://foo.bar.example.com/foo/bar example.net/foo/bar
Essentially, I need to be able to match any normal URL. How can I extract example.com
(or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?
Well you can use parse_url
to get the host:
$info = parse_url($url); $host = $info['host'];
Then, you can do some fancy stuff to get only the TLD and the Host
$host_names = explode(".", $host); $bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];
Not very elegant, but should work.
If you want an explanation, here it goes:
First we grab everything between the scheme (http://
, etc), by using parse_url
's capabilities to... well.... parse URL's. :)
Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname
would become:
array("test", "world", "hello", "myname");
After that, we take the number of elements in the array (4).
Then, we subtract 2 from it to get the second to last string (the hostname, or example
, in your example)
Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD
Then we combine those two parts with a period, and you have your base host name.
My solution in https://gist.github.com/pocesar/5366899
and the tests are here http://codepad.viper-7.com/GAh1tP
It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).
There's a test included with many domain names.
Won't paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With