I have a series of strings (URLs) in different forms as: <ol> <li><code>http://domain name.anything/anypath</code></li> <li><code>https://dmain name.anything/anypath</code></li> <li><code>http://www.domain name.anything/anypath</code></li> <li><code>https://www.dmain name.anything/anypath</code></li> </ol> These strings are saved in CSV file. I need to parse every URL in order to get the domain name only, <code>domain name.anything</code>. i.e, the part after the first <code>.</code> and before the first <code>/</code>. I separated the strings using <code>split</code> method, then converted each string to a URL, then used the <code>toAuthority</code> function to get the domain name only. The problem is that, <code>toAuthority</code> and <code>toHost</code> are doing the same job for me, they include the <code>www.</code> that I don't want. Though, in the tutorial from Oracle, it seems that <code>toAuthority</code> supposed to return the domain name without <code>www.</code>. How can I extract the domain name part only without the <code>www.</code> of the URL ??

<blockquote> What is the difference between the getHost and getAuthority methods in the URL class? </blockquote> To really understand this, you should read URI specification - RFC 2396. The short answer is that the authority component consists of the host component together with an optional port number, username and password ... depending on the URL scheme that is used. <hr> <blockquote> How can I extract the domain name part only without the "www." of the URL ?? </blockquote> You call <code>getHost()</code>, test if it starts with the string <code>"www."</code> and if it does you remove it. But before you start doing things like that, you need to understand that removing the "www." may give you a URL that doesn't work, or that resolves to a document or service that is different to the one the the original URL resolves to. It is a bad idea to gratuitously tidy up URLs ... unless you have detailed knowledge of how the sites in question are organized. The convention that "foo.com" and "www.foo.com" are the same place is just a convention, and a lot of sites don't implement it. Removing "www." would be a bad idea because it is liable to turn resolvable URLs into URLs that don't resolve.

you can use google guava to get the domain name from the host name: <pre class="prettyprint"><code>InternetDomainName.from(hostname).topPrivateDomain().toString() </code></pre>

What is the difference between getHost and getAuthority methods in URL class in Java?

Tags:

java

networking

I have a series of strings (URLs) in different forms as:

http://domain name.anything/anypath
https://dmain name.anything/anypath
http://www.domain name.anything/anypath
https://www.dmain name.anything/anypath

These strings are saved in CSV file. I need to parse every URL in order to get the domain name only, domain name.anything. i.e, the part after the first . and before the first /.

I separated the strings using split method, then converted each string to a URL, then used the toAuthority function to get the domain name only. The problem is that, toAuthority and toHost are doing the same job for me, they include the www. that I don't want. Though, in the tutorial from Oracle, it seems that toAuthority supposed to return the domain name without www..

How can I extract the domain name part only without the www. of the URL ??

666

asked Jun 26 '12 14:06

Jury A

2 Answers

What is the difference between the getHost and getAuthority methods in the URL class?

To really understand this, you should read URI specification - RFC 2396.

The short answer is that the authority component consists of the host component together with an optional port number, username and password ... depending on the URL scheme that is used.

How can I extract the domain name part only without the "www." of the URL ??

You call getHost(), test if it starts with the string "www." and if it does you remove it.

But before you start doing things like that, you need to understand that removing the "www." may give you a URL that doesn't work, or that resolves to a document or service that is different to the one the the original URL resolves to. It is a bad idea to gratuitously tidy up URLs ... unless you have detailed knowledge of how the sites in question are organized.

The convention that "foo.com" and "www.foo.com" are the same place is just a convention, and a lot of sites don't implement it. Removing "www." would be a bad idea because it is liable to turn resolvable URLs into URLs that don't resolve.

answered Oct 03 '22 06:10

Stephen C

you can use google guava to get the domain name from the host name:

InternetDomainName.from(hostname).topPrivateDomain().toString()

answered Oct 03 '22 07:10

Martin Charlesworth

Related questions
                            
                                When to choose several processes over threads in Java?
                            
                                Why can't HttpServletResponse Headers be updated AFTER getWriter() is called?
                            
                                Adding element in two dimensional ArrayList
                            
                                JPA query with CASE WHEN in the WHERE clause. How to do?
                            
                                Java System Environment Variable
                            
                                ExecutorService, how to know when all threads finished without blocking the main thread?
                            
                                Very easy to solve issue with SimpleXML. What i'm doing wrong?
                            
                                Java - EnumSet.add(enum), throws NullPointerException
                            
                                Java Concurrency in Practice: race condition in BoundedExecutor?
                            
                                How do I get Logback to work nicely with Eclipse?
                            
                                How to do 'mvn compile' and 'mvn package' with m2e in Eclipse?
                            
                                Regex to validate a filename
                            
                                Collection contains with @Query
                            
                                cannot be resolved to a type (jsp + eclipse)
                            
                                Mocking net.sf.ehcache.Cache (ehcache) with .put method stub (Mockito)
                            
                                How to encode Internet address
                            
                                Java AES CBC Decryption
                            
                                java tomcat utf-8 encoding issue
                            
                                making a java package in the command line
                            
                                tomcat 6 thread pool for asynchronous processing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With