Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract 2nd level domain from domain? - Python

I have a list of domains e.g.

  • site.co.uk

  • site.com

  • site.me.uk

  • site.jpn.com

  • site.org.uk

  • site.it

also the domain names can contain 3rd and 4th level domains e.g.

  • test.example.site.org.uk

  • test2.site.com

I need to try and extract the 2nd level domain, in all these cases being site


Any ideas? :)

like image 341
RadiantHex Avatar asked Feb 06 '11 23:02

RadiantHex


3 Answers

no way to reliably get that. Subdomains are arbitrary and there is a monster list of domain extensions that grows every day. Best case is you check against the monster list of domain extensions and maintain the list.

list: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

like image 149
Crayon Violent Avatar answered Sep 20 '22 17:09

Crayon Violent


Following @kohlehydrat's suggestion:

import urllib2

class TldMatcher(object):
    # use class vars for lazy loading
    MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
    TLDS = None

    @classmethod
    def loadTlds(cls, url=None):
        url = url or cls.MASTERURL

        # grab master list
        lines = urllib2.urlopen(url).readlines()

        # strip comments and blank lines
        lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']

        cls.TLDS = set(lines)

    def __init__(self):
        if TldMatcher.TLDS is None:
            TldMatcher.loadTlds()

    def getTld(self, url):
        best_match = None
        chunks = url.split('.')

        for start in range(len(chunks)-1, -1, -1):
            test = '.'.join(chunks[start:])
            startest = '.'.join(['*']+chunks[start+1:])

            if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
                best_match = test

        return best_match

    def get2ld(self, url):
        urls = url.split('.')
        tlds = self.getTld(url).split('.')
        return urls[-1 - len(tlds)]


def test_TldMatcher():
    matcher = TldMatcher()

    test_urls = [
        'site.co.uk',
        'site.com',
        'site.me.uk',
        'site.jpn.com',
        'site.org.uk',
        'site.it'
    ]

    errors = 0
    for u in test_urls:
        res = matcher.get2ld(u)
        if res != 'site':
            print "Error: found '{0}', should be 'site'".format(res)
            errors += 1

    if errors==0:
        print "Passed!"
    return (errors==0)
like image 23
Hugh Bothwell Avatar answered Sep 17 '22 17:09

Hugh Bothwell


Using python tld

https://pypi.python.org/pypi/tld

$ pip install tld

from tld import get_tld, get_fld

print(get_tld("http://www.google.co.uk"))
'co.uk'

print(get_fld("http://www.google.co.uk"))
'google.co.uk'
like image 35
Artur Barseghyan Avatar answered Sep 16 '22 17:09

Artur Barseghyan