I want to extract just root domain name from the following subdomains, URLs in python.
1) https://abc.example.com/dir
2) http://abc.example.abde?param=value
3) https://aaa.abc.example.zaddj?param=value
4) tcp://cces.example.com:5598
5) ccc.ddd.example.com
I have treid multiples of methods as described following, but nothing works properly for all of these scenarios,
It is not working, as number 4 and 5 are non HTTP URLs and urlparse is not able to understand. Used following code:
from urllib.parse import urlparse
def get_root_domain_from_url(url):
    parsed_url = urlparse(url)
    domain_parts = parsed_url.netloc.split('.')
    if len(domain_parts) > 2:
        root_domain = domain_parts[-2] + '.' + domain_parts[-1]
    else:
        root_domain = parsed_url.netloc
    return root_domain
It is not working, as number 2 and 3 are non TLD domains so tldextract is not returning valid results. Used following code:
import tldextract
def get_root_domain_from_url(url):
    extracted = tldextract.extract(url)
    if extracted.subdomain:
        root_domain = extracted.subdomain[:-1] + '.' + extracted.domain + '.' + extracted.suffix
    else:
        root_domain = extracted.domain + '.' + extracted.suffix
    return root_domain
It is not working properly as URL scheme may changed for every scenario. Also i am little weak in creating regex for such things..
I want result as following:
https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com
I would use regex approach with a search/group this way:
def get_root_domain_from_url(url):
    p = r"(?:\w+://)?(?:\w+\.)*(\w+\.[a-z]+)(?:[/?:]?.*)"
    root_domain = re.search(p, url).group(1)
    return root_domain
(?:\w+://)?     : Matches an optional protocol specifier without capturing it(?:\w+\.)*       : Matches zero or
more subdomains without capturing them(\w+\.[a-z]+)  : Matches the domain name (e.g., example.com) and captures it(?:[/?:]?.*)    : Matches an optional path after the domain without capturing itTest/Output :
for url in list_urls:
    print(f"{url} -> {get_root_domain_from_url(url)}")
https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com
Demo : [Regex101]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With