I want to extract just root domain name from the following subdomains, URLs in python.
1) https://abc.example.com/dir
2) http://abc.example.abde?param=value
3) https://aaa.abc.example.zaddj?param=value
4) tcp://cces.example.com:5598
5) ccc.ddd.example.com
I have treid multiples of methods as described following, but nothing works properly for all of these scenarios,
It is not working, as number 4 and 5 are non HTTP URLs and urlparse is not able to understand. Used following code:
from urllib.parse import urlparse
def get_root_domain_from_url(url):
parsed_url = urlparse(url)
domain_parts = parsed_url.netloc.split('.')
if len(domain_parts) > 2:
root_domain = domain_parts[-2] + '.' + domain_parts[-1]
else:
root_domain = parsed_url.netloc
return root_domain
It is not working, as number 2 and 3 are non TLD domains so tldextract is not returning valid results. Used following code:
import tldextract
def get_root_domain_from_url(url):
extracted = tldextract.extract(url)
if extracted.subdomain:
root_domain = extracted.subdomain[:-1] + '.' + extracted.domain + '.' + extracted.suffix
else:
root_domain = extracted.domain + '.' + extracted.suffix
return root_domain
It is not working properly as URL scheme may changed for every scenario. Also i am little weak in creating regex for such things..
I want result as following:
https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com
I would use regex approach with a search
/group
this way:
def get_root_domain_from_url(url):
p = r"(?:\w+://)?(?:\w+\.)*(\w+\.[a-z]+)(?:[/?:]?.*)"
root_domain = re.search(p, url).group(1)
return root_domain
(?:\w+://)?
: Matches an optional protocol specifier without capturing it(?:\w+\.)*
: Matches zero or
more subdomains without capturing them(\w+\.[a-z]+)
: Matches the domain name (e.g., example.com) and captures it(?:[/?:]?.*)
: Matches an optional path after the domain without capturing itTest/Output :
for url in list_urls:
print(f"{url} -> {get_root_domain_from_url(url)}")
https://abc.example.com/dir -> example.com
http://abc.example.abde?param=value -> example.abde
https://aaa.abc.example.zaddj?param=value -> example.zaddj
tcp://cces.example.com:5598 -> example.com
ccc.ddd.example.com -> example.com
Demo : [Regex101]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With