Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard URL Normalization - Java

I would like to ask if there's any Java package or library that have the standard URL normalization?

5 Components of URL Representation

http://www[dot]example[dot]com:8040/folder/exist?name=sky#head

  1. scheme: http
  2. authority: www.example.com:8040
  3. path: /folder/exist
  4. query: ?name=sky
  5. fragment: #head

The 3 types of standard URL normalization

Syntax-Based Normalization

  • Case normalization – convert all letter at scheme and authority components to lower case
  • Percent-encoded normalization – decode any percent-encoded octet that corresponds to unreserved character, such as %2D for hyphen and %5 for underscore
  • Path segment normalization – remove dot-segments from the path component, such as ‘.’ and ‘..’

Scheme-Based Normalization

  • Add trailing ‘/’ after the authority component of URL
  • Remove default port number, such as 80 for http scheme
  • Truncate the fragment of URL

Protocol-Based Normalization

  • Only appropriate when the results of accessing the resources are equivalent
  • For example, example.com/data is directed to example.com/data/ by origin server
like image 611
lockone Avatar asked Jul 29 '10 17:07

lockone


1 Answers

As others have mentioned, java.net.URL and/or java.net.URI are some obvious starting points.

Here some other options:

  1. Galimatias (Spanish for "gibberish") appears to be an opinionated and relatively popular URL normalization library for Java. The source code can be found at github.com/smola/galimatias.

    galimatias started out of frustration with java.net.URL and java.net.URI. Both of them are good for basic use cases, but severely broken for others

  2. The github.com/sentric/url-normalization library provides another (unusual, in my opinion) approach where it reverses the domain portion; e.g. "com.stackoverflow" instead of "stackoverflow.com".

You can find other variations, sometimes implemented in languages such as Python, Ruby, and PHP on Github.

like image 93
David J. Avatar answered Oct 09 '22 19:10

David J.