Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalising possibly encoded URI strings in Java

Using Java, I want to strip the fragment identifier and do some simple normalisation (e.g., lowercase schemes, hosts) of a diverse set of URIs. The input and output URIs should be equivalent in a general HTTP sense.

Typically, this should be straightforward. However, for URIs like http://blah.org/A_%28Secret%29.xml#blah, which percent encodes (Secret), the behaviour of java.util.URI makes life difficult.

The normalisation method should return http://blah.org/A_%28Secret%29.xml since the URIs http://blah.org/A_%28Secret%29.xml and http://blah.org/A_(Secret).xml are not equivalent in interpretation [§2.2; RFC3968]

So we have the two following normalisation methods:

URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah");
System.out.println(u);
        // prints "http://blah.org/A_%28Secret%29.xml#blah"

String path1 = u.getPath();      //gives "A_(Secret).xml"
String path2 = u.getRawPath();   //gives "A_%28Secret%29.xml"


//NORMALISE METHOD 1
URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), 
                      u.getHost().toLowerCase(), u.getPort(), path1, 
                      u.getQuery(), null);
System.out.println(norm1);
// prints "http://blah.org/A_(Secret).xml"

//NORMALISE METHOD 2
URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
                      u.getHost().toLowerCase(), u.getPort(), path2, 
                      u.getQuery(), null);
System.out.println(norm2);
// prints "http://blah.org/A_%2528Secret%2529.xml"

As we see, the URI is parsed and rebuilt without the fragment identifier.

However, for method 1, u.getPath() returns an unencoded URI, which changes the final URI.

For method 2, u.getRawPath() returns the original path, but when passed to the URI constructor, Java decides to add double-encoding.

This feels like a Chinese finger trap.

So two main questions:

  • Why does java.util.URI feel the need to play with encoding?
  • How can this normalise method be implemented without fiddling with the original percent encoding?

(I would rather not have to implement the parse/concatenate methods of java.util.URI, which are non-trivial.)


EDIT: Here's some further info from URI javadoc.

  • The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.

  • The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.

  • The getRawUserInfo, getRawPath, getRawQuery, getRawFragment, getRawAuthority, and getRawSchemeSpecificPart methods return the values of their corresponding components in raw form, without interpreting any escaped octets. The strings returned by these methods may contain both escaped octets and other characters, and will not contain any illegal characters.

  • The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methods decode any escaped octets in their corresponding components. The strings returned by these methods may contain both other characters and illegal characters, and will not contain any escaped octets.

  • The toString method returns a URI string with all necessary quotation but which may contain other characters.

  • The toASCIIString method returns a fully quoted and encoded URI string that does not contain any other characters.

So I cannot use the multi-argument constructor without having the URL encoding messed with internally by the URI class. Pah!

like image 579
badroit Avatar asked Feb 23 '12 19:02

badroit


2 Answers

Because java.net.URI is introduced in java 1.4 (which comes out at 2002) and it's based on RFC2396 which treats '(' and ')' as characters which don't need escape and the semantic doesn't change even if it is escaped, furthermore it even says one should not escape it unless it's necessary (§2.3, RFC2396).

But RFC3986 (which comes out at 2005) changed this, and I guess developers of JDK decide not to change the behavior of java.net.URI for compatibility of existing code.

By random googling, I found Jena IRI looks good.

public class IRITest {
public static void main(String[] args) {
    IRIFactory factory = IRIFactory.uriImplementation();
    IRI iri = factory.construct("http://blah.org/A_%28Secret%29.xml#blah");
    ArrayList<String> a = new ArrayList<String>();
    a.add(iri.getScheme());
    a.add(iri.getRawUserinfo());
    a.add(iri.getRawHost());
    a.add(iri.getRawPath());
    a.add(iri.getRawQuery());
    a.add(iri.getRawFragment());
    IRI iri2 = factory.construct("http://blah.org/A_(Secret).xml#blah");
    ArrayList<String> b = new ArrayList<String>();
    b.add(iri2.getScheme());
    b.add(iri2.getRawUserinfo());
    b.add(iri2.getRawHost());
    b.add(iri2.getRawPath());
    b.add(iri2.getRawQuery());
    b.add(iri2.getRawFragment());

    System.out.println(a);
    //[http, null, blah.org, /A_%28Secret%29.xml, null, blah]
    System.out.println(b);
    //[http, null, blah.org, /A_(Secret).xml, null, blah]
}
}
like image 133
Chikei Avatar answered Oct 18 '22 03:10

Chikei


Note this passage at the end of [§2.2; RFC3968]

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.

So, as long as the scheme is http or https, the encoding is the correct behavior.

Try using the toASCIIString method instead of toString for printing the URI. E.g.:

System.put.println(norm1.toASCIIString());
like image 4
Devon_C_Miller Avatar answered Oct 18 '22 04:10

Devon_C_Miller