Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract registered domain from URL based on Public Suffix List

Given a URL, how do I extract the registered domain using the Public Suffix List (list of effective TLDs, e.g. this list)?

For instance, considering a.bg is a valid public suffix:

http://www.test.start.a.bg/hello.html -> start.a.bg 
http://test.start.a.bg/               -> start.a.bg
http://test.start.abc.bg/             -> abc.bg (.bg is the public suffix)

This cannot be done using simple string manipulation because the public suffix can consist of multiple levels depending on the TLD.

P.S. It doesn't matter how I read the list (database or flat file), but the list should be accessible locally so I'm not always dependent on external services.

like image 390
ilhan Avatar asked Nov 25 '11 18:11

ilhan


3 Answers

You can use parse_url() to extract the hostname, then use the library provided by regdom to determine the registered domain name (dn + eTLD). For example:

require_once("effectiveTLDs.inc.php");
require_once("regDomain.inc.php");

$url =  'http://www.metu.edu.tr/dhasjkdas/sadsdds/sdda/sdads.html';
echo getRegisteredDomain(parse_url($url, PHP_URL_HOST));

That will print out metu.edu.tr.

Other examples I've tried:

http://www.xyz.start.bg/hello   ->   start.bg
http://www.start.a.bg/world     ->   start.a.bg  (a.bg is a listed eTLD)
http://xyz.ma219.metu.edu.tr    ->   metu.edu.tr
http://www.google.com/search    ->   google.com
http://google.co.uk/search?asd  ->   google.co.uk

UPDATE: These libraries have been moved to: https://github.com/leth/registered-domains-php

like image 73
Shawn Chin Avatar answered Nov 17 '22 18:11

Shawn Chin


This question is a bit old, but there's a new solution: https://github.com/jeremykendall/php-domain-parser

This library does exactly what you want. Here's the setup:

$pslManager = new Pdp\PublicSuffixListManager();
$parser = new Pdp\Parser($pslManager->getList());
echo $parser->getRegisterableDomain('www.scottwills.co.uk');

This will print "scottwills.co.uk".

like image 44
Alex Grin Avatar answered Nov 17 '22 19:11

Alex Grin


I recomend to use TLDExtract, it has regurly updatable database that generated from PSL.

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false
like image 1
Oleksandr Fediashov Avatar answered Nov 17 '22 19:11

Oleksandr Fediashov