I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For example:
URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html");
String host = uri.getHost();
System.out.println(host);
which prints:
mydomain.ltd.uk
Now what I'd like to do is robustly identify the SLD ("ltd.uk") component. Any ideas?
Edit: I'm ideally looking for a general solution, so I'd match ".uk" in "police.uk", ".co.uk" in "bbc.co.uk" and ".com" in "amazon.com".
Thanks
A Second Level Domain (SLD) is the part of the domain name that is located right before a Top Level Domain (TLD). For example, in mozilla.org the SLD is mozilla and the TLD is org .
In simple terms, a second level domain is the name just to the left of the domain extension, the .com or . net. The website example.com was reserved for explaining the relationship between top-level domains (TLDs) and second level domains (SLDs).
A subdomain is an aspect of your domain name that's related to your second level domain. For example, if you wanted to create a subdomain for your company blog, it would look like this “blog.mysite.com.”
After reeading everything here, the correct solution should be (with guava)
InternetDomainName.from(uriHost).topPrivateDomain().toString();
errors when using Guava to get the private domain name
Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.
Apache Http Component (HttpClient 4) comes with classes to handle this,
org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser
You need to download the public suffix list from here,
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
After looking at these answers and not being satisfied by them I used the class com.google.common.net.InternetDomainName
to subtract the public parts of a domain name from all the parts:
Set<String> nonePublicDomainParts(String uriHost) {
InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
InternetDomainName publicDomainName = fullDomainName.publicSuffix();
Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
nonePublicParts.removeAll(publicDomainName.parts());
return nonePublicParts;
}
That class is on maven in the guava library:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>10.0.1</version>
<scope>compile</scope>
</dependency>
Internally this class is using a TldPatterns.class which is package private and has the list of top level domains baked into it.
Interestingly, if you look at that classes source at the link below it explicitly lists "police.uk" as a private domain name. This is correct as police.uk is a private domain controlled by the police; else criminals.police.uk will be emailing you asking for your credit card details in relation to their ongoing investigations into card fraud ;)
http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With