Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract domain name from URL in Python

I am tring to extract the domain names out of a list of URLs. Just like in https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you

like image 864
kobibo Avatar asked May 17 '17 10:05

kobibo


People also ask

How do I find the domain of a URL in Python?

To get the domain from a URL in Python, the easiest way is to use the urllib. parse module urlparse() function and access the netloc attribute.

How do I extract the domain name from a string @?

Explanation : Domain name, gfg.com extracted. In this, we harness the fact that “@” symbol is separator for domain name and local-part of Email address, so, index () is used to get its index, and is then sliced till end. In this, we split the string on “@” and use its 1st index to get the required domain name.

How to get domain name information in Python?

There is also a simple whois command in Linux to extract domain info, but since we're Python developers, then we'll be using Python for this. In this section, we'll use whois to tell whether a domain name exists and is registered, the below function does that:

How to process a URL using urllib in Python?

It could include the protocol ( http or https ), host/domain name, subdomain, or the request path. urllib is a Python module to process URL s. You can dissect and process a URL using urlparse function within the urllib module. It could split the URL into scheme ( http or https ), netloc (subdomain, domain, TLD ), and path.

How to get the index of an email address in Python?

Method #1 : Using index () + slicing In this, we harness the fact that “@” symbol is separator for domain name and local-part of Email address, so, index () is used to get its index, and is then sliced till end. Python3


3 Answers

Use tldextract which is more efficient version of urlparse, tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
like image 75
akash karothiya Avatar answered Oct 10 '22 20:10

akash karothiya


It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.

And from the netloc you could easily extract the domain name by using split

like image 4
Mariano Anaya Avatar answered Oct 10 '22 22:10

Mariano Anaya


Simple solution via regex

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]
like image 2
Sharif O Avatar answered Oct 10 '22 21:10

Sharif O