Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between getHost and getAuthority methods in URL class in Java?

I have a series of strings (URLs) in different forms as:

  1. http://domain name.anything/anypath
  2. https://dmain name.anything/anypath
  3. http://www.domain name.anything/anypath
  4. https://www.dmain name.anything/anypath

These strings are saved in CSV file. I need to parse every URL in order to get the domain name only, domain name.anything. i.e, the part after the first . and before the first /.

I separated the strings using split method, then converted each string to a URL, then used the toAuthority function to get the domain name only. The problem is that, toAuthority and toHost are doing the same job for me, they include the www. that I don't want. Though, in the tutorial from Oracle, it seems that toAuthority supposed to return the domain name without www..

How can I extract the domain name part only without the www. of the URL ??

like image 666
Jury A Avatar asked Jun 26 '12 14:06

Jury A


People also ask

What are the methods in the URL class used for parsing the URL?

The URL class provides several methods that let you query URL objects. You can get the protocol, authority, host name, port number, path, query, filename, and reference from a URL using these accessor methods: getProtocol. Returns the protocol identifier component of the URL.

Which method of the URL class allows you to retrieve the host name of the URL?

getHost() function is a part of URL class. The function getHost() returns the Host of a specified URL. The Host part of the URL is the host name of the URL.


2 Answers

What is the difference between the getHost and getAuthority methods in the URL class?

To really understand this, you should read URI specification - RFC 2396.

The short answer is that the authority component consists of the host component together with an optional port number, username and password ... depending on the URL scheme that is used.


How can I extract the domain name part only without the "www." of the URL ??

You call getHost(), test if it starts with the string "www." and if it does you remove it.

But before you start doing things like that, you need to understand that removing the "www." may give you a URL that doesn't work, or that resolves to a document or service that is different to the one the the original URL resolves to. It is a bad idea to gratuitously tidy up URLs ... unless you have detailed knowledge of how the sites in question are organized.

The convention that "foo.com" and "www.foo.com" are the same place is just a convention, and a lot of sites don't implement it. Removing "www." would be a bad idea because it is liable to turn resolvable URLs into URLs that don't resolve.

like image 93
Stephen C Avatar answered Oct 03 '22 06:10

Stephen C


you can use google guava to get the domain name from the host name:

InternetDomainName.from(hostname).topPrivateDomain().toString()
like image 45
Martin Charlesworth Avatar answered Oct 03 '22 07:10

Martin Charlesworth