For example, the address is: <code>Address = http://lol1.domain.com:8888/some/page</code> I want to save the subdomain into a variable so i could do like so; <pre class="prettyprint"><code>print SubAddr >> lol1 </code></pre>

<code>urlparse.urlparse</code> will split the URL into protocol, location, port, etc. You can then split the location by <code>.</code> to get the subdomain. <pre class="prettyprint"><code>import urlparse url = urlparse.urlparse(address) subdomain = url.hostname.split('.')[0] </code></pre>

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information: <pre class="prettyprint"><code>>> import tldextract >> tldextract.extract("http://lol1.domain.com:8888/some/page" ExtractResult(subdomain='lol1', domain='domain', suffix='com') >> tldextract.extract("http://sub.lol1.domain.com:8888/some/page" ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com') >> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page") ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='') </code></pre> Note that tldextract properly handles sub-domains.

A very basic approach, without any sanity checking could look like: <pre class="prettyprint"><code>address = 'http://lol1.domain.com:8888/some/page' host = address.partition('://')[2] sub_addr = host.partition('.')[0] print sub_addr </code></pre> This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain: http://www.google.com/ Is that what you mean?

Get subdomain from URL using Python

Tags:

python

string

url

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1

292

asked Aug 03 '11 11:08

Marko

4 Answers

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

answered Sep 21 '22 08:09

Daniel Roseman

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.

answered Sep 18 '22 08:09

Lluís Vilanova

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?

answered Sep 19 '22 08:09

Steve Mayne

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

answered Sep 19 '22 08:09

Acorn

Related questions
                            
                                Why is my Python NumPy code faster than C++?
                            
                                How to use OpenCV with Heroku
                            
                                File open: Is this bad Python style?
                            
                                Format a number containing a decimal point with leading zeroes
                            
                                How to retain leading zeros of int variables?
                            
                                Python range to list
                            
                                Grid-Search finding Parameters for AUC
                            
                                Are there any declaration keywords in Python?
                            
                                How to pick a random english word from a list [closed]
                            
                                How do "and" and "or" work when combined in one statement?
                            
                                Python : Reverse Order Of List [duplicate]
                            
                                Modify dict values inplace
                            
                                Trouble with Django sending email though smtp.gmail.com
                            
                                Get diagonal without using numpy?
                            
                                How to do encapsulation in Python?
                            
                                Shift list elements to the right and shift list element at the end to the beginning
                            
                                Getting "__init__() got an unexpected keyword argument 'document'" this error in python I'm working with Word2Vec and gensim
                            
                                Python 'No module named' error; 'package' is not a package
                            
                                How to fix "invalid argument: invalid 'expiry'" in Selenium when adding cookies to a chromedriver?
                            
                                Aborting, target uses selinux but Python bindings (libselinux-Python) aren't installed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With