Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Root Domain of Link

Tags:

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

like image 920
Gavin Schulz Avatar asked Oct 05 '09 18:10

Gavin Schulz


People also ask

What is root domain in URL?

While the term "root domain" was originally created in the context of DNS (domain-name servers), it typically refers to the combination of a unique domain name and a top-level domain (extensions) to form a complete "website address." Your website's root domain is the highest page in your site hierarchy (probably your ...


1 Answers

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname 

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

import publicsuffix import urlparse  def get_base_domain(url):     # This causes an HTTP request; if your script is running more than,     # say, once a day, you'd want to cache it yourself.  Make sure you     # update frequently, though!     psl = publicsuffix.fetch()      hostname = urlparse.urlparse(url).hostname      return publicsuffix.get_public_suffix(hostname, psl) 
like image 125
Ben Blank Avatar answered Sep 30 '22 03:09

Ben Blank