Need a way to extract a domain name without the subdomain from a url using Python urlparse.
For example, I would like to extract "google.com"
from a full url like "http://www.google.com"
.
The closest I can seem to come with urlparse
is the netloc
attribute, but that includes the subdomain, which in this example would be www.google.com
.
I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)
Or, if urlparse
can't do what I need, does anyone know any other Python url-parsing libraries that would?
To get the domain from a URL in Python, the easiest way is to use the urllib. parse module urlparse() function and access the netloc attribute. When working with URLs in Python, the ability to easily extract information about those URLs can be very valuable.
URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.
You probably want to check out tldextract, a library designed to do this kind of thing.
It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).
>>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
So in your case:
>>> extracted = tldextract.extract('http://www.google.com') >>> "{}.{}".format(extracted.domain, extracted.suffix) "google.com"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With