Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract root domain name from multiple types of URLs, subdomains, URLs with port number, etc in Python?

I want to extract just root domain name from the following subdomains, URLs in python.

1) https://abc.example.com/dir
2) http://abc.example.abde?param=value
3) https://aaa.abc.example.zaddj?param=value
4) tcp://cces.example.com:5598
5) ccc.ddd.example.com

I have treid multiples of methods as described following, but nothing works properly for all of these scenarios,

Method 1 tried: Using python library: urlparse

It is not working, as number 4 and 5 are non HTTP URLs and urlparse is not able to understand. Used following code:

from urllib.parse import urlparse

def get_root_domain_from_url(url):
    parsed_url = urlparse(url)
    domain_parts = parsed_url.netloc.split('.')
    if len(domain_parts) > 2:
        root_domain = domain_parts[-2] + '.' + domain_parts[-1]
    else:
        root_domain = parsed_url.netloc
    return root_domain

Method 2 tried: Using python library: tldextract

It is not working, as number 2 and 3 are non TLD domains so tldextract is not returning valid results. Used following code:

import tldextract

def get_root_domain_from_url(url):
    extracted = tldextract.extract(url)
    if extracted.subdomain:
        root_domain = extracted.subdomain[:-1] + '.' + extracted.domain + '.' + extracted.suffix
    else:
        root_domain = extracted.domain + '.' + extracted.suffix
    return root_domain

Method 3 tried: Using python regex.

It is not working properly as URL scheme may changed for every scenario. Also i am little weak in creating regex for such things..

I want result as following:

https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com

like image 301
J Jogal Avatar asked Sep 16 '25 21:09

J Jogal


1 Answers

I would use regex approach with a search/group this way:

def get_root_domain_from_url(url):
    p = r"(?:\w+://)?(?:\w+\.)*(\w+\.[a-z]+)(?:[/?:]?.*)"
    root_domain = re.search(p, url).group(1)
    return root_domain
  • (?:\w+://)?     : Matches an optional protocol specifier without capturing it
  • (?:\w+\.)*       : Matches zero or more subdomains without capturing them
  • (\w+\.[a-z]+)  : Matches the domain name (e.g., example.com) and captures it
  • (?:[/?:]?.*)    : Matches an optional path after the domain without capturing it

Test/Output :

for url in list_urls:
    print(f"{url} -> {get_root_domain_from_url(url)}")

https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com

Demo : [Regex101]

like image 100
Timeless Avatar answered Sep 19 '25 10:09

Timeless