I have a list of domains e.g.
site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it
also the domain names can contain 3rd and 4th level domains e.g.
test.example.site.org.uk
test2.site.com
I need to try and extract the 2nd level domain, in all these cases being site
Any ideas? :)
no way to reliably get that. Subdomains are arbitrary and there is a monster list of domain extensions that grows every day. Best case is you check against the monster list of domain extensions and maintain the list.
list: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Following @kohlehydrat's suggestion:
import urllib2
class TldMatcher(object):
# use class vars for lazy loading
MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
TLDS = None
@classmethod
def loadTlds(cls, url=None):
url = url or cls.MASTERURL
# grab master list
lines = urllib2.urlopen(url).readlines()
# strip comments and blank lines
lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']
cls.TLDS = set(lines)
def __init__(self):
if TldMatcher.TLDS is None:
TldMatcher.loadTlds()
def getTld(self, url):
best_match = None
chunks = url.split('.')
for start in range(len(chunks)-1, -1, -1):
test = '.'.join(chunks[start:])
startest = '.'.join(['*']+chunks[start+1:])
if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
best_match = test
return best_match
def get2ld(self, url):
urls = url.split('.')
tlds = self.getTld(url).split('.')
return urls[-1 - len(tlds)]
def test_TldMatcher():
matcher = TldMatcher()
test_urls = [
'site.co.uk',
'site.com',
'site.me.uk',
'site.jpn.com',
'site.org.uk',
'site.it'
]
errors = 0
for u in test_urls:
res = matcher.get2ld(u)
if res != 'site':
print "Error: found '{0}', should be 'site'".format(res)
errors += 1
if errors==0:
print "Passed!"
return (errors==0)
Using python tld
https://pypi.python.org/pypi/tld
$ pip install tld
from tld import get_tld, get_fld
print(get_tld("http://www.google.co.uk"))
'co.uk'
print(get_fld("http://www.google.co.uk"))
'google.co.uk'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With