Using Java, I want to strip the fragment identifier and do some simple normalisation (e.g., lowercase schemes, hosts) of a diverse set of URIs. The input and output URIs should be equivalent in a general HTTP sense.
Typically, this should be straightforward. However, for URIs like http://blah.org/A_%28Secret%29.xml#blah
, which percent encodes (Secret)
, the behaviour of java.util.URI
makes life difficult.
The normalisation method should return http://blah.org/A_%28Secret%29.xml
since the URIs http://blah.org/A_%28Secret%29.xml
and http://blah.org/A_(Secret).xml
are not equivalent in interpretation [§2.2; RFC3968]
So we have the two following normalisation methods:
URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah");
System.out.println(u);
// prints "http://blah.org/A_%28Secret%29.xml#blah"
String path1 = u.getPath(); //gives "A_(Secret).xml"
String path2 = u.getRawPath(); //gives "A_%28Secret%29.xml"
//NORMALISE METHOD 1
URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
u.getHost().toLowerCase(), u.getPort(), path1,
u.getQuery(), null);
System.out.println(norm1);
// prints "http://blah.org/A_(Secret).xml"
//NORMALISE METHOD 2
URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
u.getHost().toLowerCase(), u.getPort(), path2,
u.getQuery(), null);
System.out.println(norm2);
// prints "http://blah.org/A_%2528Secret%2529.xml"
As we see, the URI is parsed and rebuilt without the fragment identifier.
However, for method 1, u.getPath()
returns an unencoded URI, which changes the final URI.
For method 2, u.getRawPath()
returns the original path, but when passed to the URI
constructor, Java decides to add double-encoding.
This feels like a Chinese finger trap.
So two main questions:
java.util.URI
feel the need to play with encoding?(I would rather not have to implement the parse/concatenate methods of java.util.URI
, which are non-trivial.)
EDIT: Here's some further info from URI
javadoc.
The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.
The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.
The getRawUserInfo, getRawPath, getRawQuery, getRawFragment, getRawAuthority, and getRawSchemeSpecificPart methods return the values of their corresponding components in raw form, without interpreting any escaped octets. The strings returned by these methods may contain both escaped octets and other characters, and will not contain any illegal characters.
The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methods decode any escaped octets in their corresponding components. The strings returned by these methods may contain both other characters and illegal characters, and will not contain any escaped octets.
The toString method returns a URI string with all necessary quotation but which may contain other characters.
The toASCIIString method returns a fully quoted and encoded URI string that does not contain any other characters.
So I cannot use the multi-argument constructor without having the URL encoding messed with internally by the URI
class. Pah!
Because java.net.URI
is introduced in java 1.4 (which comes out at 2002) and it's based on RFC2396 which treats '(' and ')' as characters which don't need escape and the semantic doesn't change even if it is escaped, furthermore it even says one should not escape it unless it's necessary (§2.3, RFC2396).
But RFC3986 (which comes out at 2005) changed this, and I guess developers of JDK decide not to change the behavior of java.net.URI
for compatibility of existing code.
By random googling, I found Jena IRI looks good.
public class IRITest {
public static void main(String[] args) {
IRIFactory factory = IRIFactory.uriImplementation();
IRI iri = factory.construct("http://blah.org/A_%28Secret%29.xml#blah");
ArrayList<String> a = new ArrayList<String>();
a.add(iri.getScheme());
a.add(iri.getRawUserinfo());
a.add(iri.getRawHost());
a.add(iri.getRawPath());
a.add(iri.getRawQuery());
a.add(iri.getRawFragment());
IRI iri2 = factory.construct("http://blah.org/A_(Secret).xml#blah");
ArrayList<String> b = new ArrayList<String>();
b.add(iri2.getScheme());
b.add(iri2.getRawUserinfo());
b.add(iri2.getRawHost());
b.add(iri2.getRawPath());
b.add(iri2.getRawQuery());
b.add(iri2.getRawFragment());
System.out.println(a);
//[http, null, blah.org, /A_%28Secret%29.xml, null, blah]
System.out.println(b);
//[http, null, blah.org, /A_(Secret).xml, null, blah]
}
}
Note this passage at the end of [§2.2; RFC3968]
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.
So, as long as the scheme is http or https, the encoding is the correct behavior.
Try using the toASCIIString method instead of toString
for printing the URI. E.g.:
System.put.println(norm1.toASCIIString());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With