Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python urlparse -- extract domain name without subdomain

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

like image 748
Clay Wardell Avatar asked Jan 18 '13 19:01

Clay Wardell


People also ask

How do I find the domain of a URL in Python?

To get the domain from a URL in Python, the easiest way is to use the urllib. parse module urlparse() function and access the netloc attribute. When working with URLs in Python, the ability to easily extract information about those URLs can be very valuable.

How do I extract a URL from text in Python?

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.


1 Answers

You probably want to check out tldextract, a library designed to do this kind of thing.

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') 

So in your case:

>>> extracted = tldextract.extract('http://www.google.com') >>> "{}.{}".format(extracted.domain, extracted.suffix) "google.com" 
like image 68
Gareth Latty Avatar answered Sep 16 '22 14:09

Gareth Latty