Get Root Domain of Link

1 Answers

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

import publicsuffix import urlparse  def get_base_domain(url):     # This causes an HTTP request; if your script is running more than,     # say, once a day, you'd want to cache it yourself.  Make sure you     # update frequently, though!     psl = publicsuffix.fetch()      hostname = urlparse.urlparse(url).hostname      return publicsuffix.get_public_suffix(hostname, psl)

125

answered Sep 30 '22 03:09

Ben Blank

Related questions
                            
                                Using Postgres with Grails
                            
                                Insert HTML into iframe
                            
                                Remove pk field from django serialized objects
                            
                                JSON object in IE6 - How?
                            
                                XCode - Multiple targets, Multiple *internationalized* names?
                            
                                How do you split a Visual Studio Solution?
                            
                                Match elements between 2 collections with Linq in c#
                            
                                Is there support in C++/STL for sorting objects by attribute?
                            
                                How to run Console Application in Background (no UI)? [duplicate]
                            
                                Xml Serialization - Render Empty Element
                            
                                Converting C++ TCP/IP applications from IPv4 to IPv6. Difficult? Worth the trouble?
                            
                                Why is lock much slower than Monitor.TryEnter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get Root Domain of Link

Tags:

Gavin Schulz

People also ask

1 Answers

Ben Blank

Recent Activity

Donate For Us