I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
While the term "root domain" was originally created in the context of DNS (domain-name servers), it typically refers to the combination of a unique domain name and a top-level domain (extensions) to form a complete "website address." Your website's root domain is the highest page in your site hierarchy (probably your ...
Getting the hostname is easy enough using urlparse:
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:
import publicsuffix import urlparse def get_base_domain(url): # This causes an HTTP request; if your script is running more than, # say, once a day, you'd want to cache it yourself. Make sure you # update frequently, though! psl = publicsuffix.fetch() hostname = urlparse.urlparse(url).hostname return publicsuffix.get_public_suffix(hostname, psl)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With