I'm having trouble encoding a URL to a URI:
mUrl = "A string url that needs to be encoded for use in a new HttpGet()";
URL url = new URL(mUrl);
URI uri = new URI(url.getProtocol(), url.getAuthority(), url.getPath(),
url.getQuery(), null);
This does not do what I expect for the following URL:
Passing in the String:
http://m.bloomingdales.com/img?url=http%3A%2F%2Fimages.bloomingdales.com%2Fis%2Fimage%2FBLM%2Fproducts%2F3%2Foptimized%2F1140443_fpx.tif%3Fwid%3D52%26qlt%3D90%2C0%26layer%3Dcomp%26op_sharpen%3D0%26resMode%3Dsharp2%26op_usm%3D0.7%2C1.0%2C0.5%2C0%26fmt%3Djpeg&ttl=30d
Comes out as:
http://m.bloomingdales.com/img?url=http%253A%252F%252Fimages.bloomingdales.com%252Fis%252Fimage%252FBLM%252Fproducts%252F3%252Foptimized%252F1140443_fpx.tif%253Fwid%253D52%2526qlt%253D90%252C0%2526layer%253Dcomp%2526op_sharpen%253D0%2526resMode%253Dsharp2%2526op_usm%253D0.7%252C1.0%252C0.5%252C0%2526fmt%253Djpeg&ttl=30d
Which is broken. For example, the %3D
is turned into %253D
It seems to be doing something mysterious to the %'s already in the string.
What's going on and what am I doing wrong here?
Percent-encoding is a mechanism to encode 8-bit characters that have specific meaning in the context of URLs. It is sometimes called URL encoding. The encoding consists of substitution: A '%' followed by the hexadecimal representation of the ASCII value of the replace character.
A space is assigned number 32, which is 20 in hexadecimal. When you see “%20,” it represents a space in an encoded URL, for example, http://www.example.com/products%20and%20services.html.
The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits that correspond to the position of the character in the ASCII character set.
You are first putting the (already-escaped) string into the URL
class. That doesn't escape anything. Then you are pulling out sections of the URL
, which returns them without any further processing (so -- they are still escaped since they were escaped when you put them in). Finally, you are putting the sections into the URI
class, using the multi-argument constructor. This constructor is specified as encoding the URI components using percentages.
Therefore, it is in this final step that, for example, ":
" becomes "%3A
" (good) and "%3A
" becomes "%253A
" (bad). Since you are putting in URLs which are already-encoded*, you don't want to encode them again.
Therefore, the single-argument constructor of URI
is your friend. It doesn't escape anything, and requires that you pass a pre-escaped string. Hence, you don't need URL
at all:
mUrl = "A string url is already percent-encoded for use in a new HttpGet()";
URI uri = new URI(mUrl);
*The only problem is if your URLs are sometimes not percent-encoded, and sometimes they are. Then you have a bigger problem. You need to decide whether your program is starting out with a URL which is always encoded, or one which needs to be encoded.
Note that there is no such thing as a full URL which is not percent-encoded. For example, you can't take the full URL "http://example.com/bob&co
" and somehow turn it into the properly-encoded URL "http://example.com/bob%26co
" -- how can you tell the difference between the syntax (which shouldn't be escaped) and the characters (which should)? This is why the single-argument form of URI
requires that strings are already-escaped. If you have unescaped strings, you need to percent-encode them before inserting them into the full URL syntax, and that is what the multi-argument constructor of URI
helps you do.
Edit: I missed the fact that the original code discards the fragment. If you want to remove the fragment (or any other part) of the URL, you can construct the URI
as above, then pull all the parts out as required (they will be decoded into regular strings), then pass them back into the URI
multi-argument constructor (where they will be re-encoded as URI components):
uri = new URI(uri.getScheme(), uri.getUserInfo(), uri.getHost(), uri.getPort(),
uri.getPath(), uri.getQuery(), null) // Remove fragment
%3d means-> = (Equal)
And
%253D --> = (Equal) decimal 6hex (byte) 3D
%253D hex indicator for CGI: %3D
The URL
class didn't decode the %-sequences when it parsed the URL, but the URI
class is encoding them (again). Use URI
to parse the URL string.
Javadocs:
http://download.oracle.com/javase/6/docs/api/java/net/URL.html
The URL class does not itself encode or decode any URL components according to the escaping mechanism defined in RFC2396. It is the responsibility of the caller to encode any fields, which need to be escaped prior to calling URL, and also to decode any escaped fields, that are returned from URL. Furthermore, because URL has no knowledge of URL escaping, it does not recognise equivalence between the encoded or decoded form of the same URL. For example, the two URLs:
http://foo.com/hello world/ and http://foo.com/hello%20world
would be considered not equal to each other. Note, the URI class does perform escaping of its component fields in certain circumstances.
The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With