Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl Regex to get the root domain of a URL

Tags:

regex

perl

How could I get some part of url?

For example:

http://www.facebook.com/xxxxxxxxxxx
http://www.stackoverflow.com/yyyyyyyyyyyyyyyy

I need to take just this part:

facebook.com
stackoverflow.com
like image 773
Aleksandra Sretenovic Avatar asked Dec 09 '22 16:12

Aleksandra Sretenovic


2 Answers

use feature qw( say state );

use Domain::PublicSuffix qw( );
use URI                  qw( );

# Returns "domain.tld" for "subdomain.domain.tld". 
# Handles multi-level TLDs such as ".co.uk".
sub root_domain {
   my ($domain) = @_;
   state $parser = Domain::PublicSuffix->new();
   return $parser->get_root_domain($domain);
}

# Accepts urls as strings and as URI objects.
sub url_root_domain {
   my ($abs_url) = @_;
   my $domain = URI->new($abs_url)->host();
   return root_domain($domain);
}

say url_root_domain('http://www.facebook.com/');       # facebook.com
say url_root_domain('https://www.facebook.com/');      # facebook.com
say url_root_domain('http://mobile.google.com/');      # google.com
say url_root_domain('http://www.theregister.co.uk/');  # theregister.co.uk
say url_root_domain('http://www.com/');                # www.com
like image 183
ikegami Avatar answered Dec 11 '22 10:12

ikegami


I like the URI answer. The OP requested a regex, so in honor of the request and as a challenge, here is the answer I came up with. To be fair, sometimes it is not easy or feasible to install a CPAN modules. I have worked on some projects that are hardened using a very specific version of Perl and only certain modules are allowed.

Here is my attempt at the regex answer. Note that the www. is optional. Sub-domains like mobile. are honored. The search for / is not greedy therefore a URL with directories on the end will be parsed correctly. I am not dependent on the protocol; it could be http, https, file, sftp whatever. The output is captured in $1.

^.*://(?:[wW]{3}\.)?([^:/]*).*$

Sample input:

http://WWW.facebook.com:80/
http://facebook.com/xxxxxxxxxxx/aaaaa
http://www.stackoverflow.com/yyyyyyyyyyyyyyyy/aaaaaaa
https://mobile.yahoo.com/yyyyyyyyyyyyyyyy/aaaaaaa
http://www.theregister.co.uk/

Sample output:

facebook.com
facebook.com
stackoverflow.com
mobile.yahoo.com
theregister.co.uk

EDIT: Thanks @ikegami for the extra challenge. :) Now it supports WWW in any mixed case and a port number like :80.

like image 35
Jess Avatar answered Dec 11 '22 11:12

Jess