I have a series of strings (URLs) in different forms as:
http://domain name.anything/anypath
https://dmain name.anything/anypath
http://www.domain name.anything/anypath
https://www.dmain name.anything/anypath
These strings are saved in CSV file. I need to parse every URL in order to get the domain name only, domain name.anything
. i.e, the part after the first .
and before the first /
.
I separated the strings using split
method, then converted each string to a URL, then used the toAuthority
function to get the domain name only. The problem is that, toAuthority
and toHost
are doing the same job for me, they include the www.
that I don't want. Though, in the tutorial from Oracle, it seems that toAuthority
supposed to return the domain name without www.
.
How can I extract the domain name part only without the www.
of the URL ??
The URL class provides several methods that let you query URL objects. You can get the protocol, authority, host name, port number, path, query, filename, and reference from a URL using these accessor methods: getProtocol. Returns the protocol identifier component of the URL.
getHost() function is a part of URL class. The function getHost() returns the Host of a specified URL. The Host part of the URL is the host name of the URL.
What is the difference between the getHost and getAuthority methods in the URL class?
To really understand this, you should read URI specification - RFC 2396.
The short answer is that the authority component consists of the host component together with an optional port number, username and password ... depending on the URL scheme that is used.
How can I extract the domain name part only without the "www." of the URL ??
You call getHost()
, test if it starts with the string "www."
and if it does you remove it.
But before you start doing things like that, you need to understand that removing the "www." may give you a URL that doesn't work, or that resolves to a document or service that is different to the one the the original URL resolves to. It is a bad idea to gratuitously tidy up URLs ... unless you have detailed knowledge of how the sites in question are organized.
The convention that "foo.com" and "www.foo.com" are the same place is just a convention, and a lot of sites don't implement it. Removing "www." would be a bad idea because it is liable to turn resolvable URLs into URLs that don't resolve.
you can use google guava to get the domain name from the host name:
InternetDomainName.from(hostname).topPrivateDomain().toString()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With