Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the second level domain of an URL (java)

Tags:

java

url

I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For example:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html");

String host = uri.getHost();

System.out.println(host);

which prints:

mydomain.ltd.uk

Now what I'd like to do is robustly identify the SLD ("ltd.uk") component. Any ideas?

Edit: I'm ideally looking for a general solution, so I'd match ".uk" in "police.uk", ".co.uk" in "bbc.co.uk" and ".com" in "amazon.com".

Thanks

like image 561
Richard H Avatar asked Dec 17 '09 18:12

Richard H


People also ask

How do you find the second level of a domain?

A Second Level Domain (SLD) is the part of the domain name that is located right before a Top Level Domain (TLD). For example, in mozilla.org the SLD is mozilla and the TLD is org .

What is the second level of a URL?

In simple terms, a second level domain is the name just to the left of the domain extension, the .com or . net. The website example.com was reserved for explaining the relationship between top-level domains (TLDs) and second level domains (SLDs).

Is a second level domain a subdomain?

A subdomain is an aspect of your domain name that's related to your second level domain. For example, if you wanted to create a subdomain for your company blog, it would look like this “blog.mysite.com.”


3 Answers

After reeading everything here, the correct solution should be (with guava)

InternetDomainName.from(uriHost).topPrivateDomain().toString();

errors when using Guava to get the private domain name

like image 116
user85155 Avatar answered Oct 27 '22 00:10

user85155


Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.

Apache Http Component (HttpClient 4) comes with classes to handle this,

org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser

You need to download the public suffix list from here,

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

like image 34
ZZ Coder Avatar answered Oct 26 '22 23:10

ZZ Coder


After looking at these answers and not being satisfied by them I used the class com.google.common.net.InternetDomainName to subtract the public parts of a domain name from all the parts:

Set<String> nonePublicDomainParts(String uriHost) {
    InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
    InternetDomainName publicDomainName = fullDomainName.publicSuffix();
    Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
    nonePublicParts.removeAll(publicDomainName.parts());
    return nonePublicParts;
}

That class is on maven in the guava library:

    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>10.0.1</version>
        <scope>compile</scope>
    </dependency>

Internally this class is using a TldPatterns.class which is package private and has the list of top level domains baked into it.

Interestingly, if you look at that classes source at the link below it explicitly lists "police.uk" as a private domain name. This is correct as police.uk is a private domain controlled by the police; else criminals.police.uk will be emailing you asking for your credit card details in relation to their ongoing investigations into card fraud ;)

http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769

like image 30
simbo1905 Avatar answered Oct 26 '22 23:10

simbo1905