Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correctly encoding characters in a URL when using HttpClient

I have a list of URLs that I need to verify are valid URLs. I've written a program in Java that uses Apache's HttpClient to check the link. I had to implement my own redirect strategy due to the presence of invalid characters (like {} in the redirect URLS) which the default stratgey didn't take care of. It works fine in the majority of the cases except for 2 of them:

  1. Escaped Characters in the path or query params, which should not be encoded further. Example:

    String url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
    

    If I use a URI object, it chokes on the "{" character.

    URI myUri = new URI(url) ==> This will fail. 
    

    If I run:

    URI myUri = new URI(UriUtils.encodeHttpUrl(url)) 
    

    it encodes the %3F to %253F. However when I follow the link using Chrome or Fiddler, I do not see %3F getting escaped again. How do I protect from over-encoding the path or query params?

  2. The last query param in the URL has a valid URL as well. Eg.

    String url = "www.example.com/Chapter1/?param1=xyz&param2=http://www.google.com/?abc=1"
    

My current encoding strategy splits up the query params and then calls URLEncoder.encode on the query params. This however causes the last param to be encoded as well (which is not the case when I follow it in Fiddler or Chrome).

I've tried a number of things (using UriUtils, special cases for URLs as last param and other hacks) but nothing seems to be ideal. Whats the best way to solve this?

like image 500
smm100 Avatar asked Jun 23 '11 02:06

smm100


2 Answers

How do I protect from over-encoding the path or query params?

You cannot "protect from over-encoding". You either encode, or you do not. You should always know, for any given string, whether it is encoded or not. You should only encode strings which are not yet encoded, and you should never encode strings which are already encoded.

So is this string encoded or not?

%3Fref%3Dsomething%26term%3D{keyword}

It seems to me like this is bad input: clearly this is not encoded because it contains invalid characters ('{' and '}'). Yet it also seems not to be an unencoded string, because it contains '%xx' sequences. So it's partly-encoded. There is no programmatic "solution" once a string is in this form -- you simply need to avoid getting a string into such a form in the first place. You may be able to construct an algorithm which "fixes" this string, by carefully looking for parts looking like a "%" followed by two hex digits, and leaving them alone. But this will break on subtle cases. Consider an unencoded string "42%23", which is supposed to be a literal representation of the mathematical expression "42 mod 23". When I put this into a URI, I expect it to encode as "42%2523" so it decodes as "42%23", but the above algorithm will break and encode it as "42%23" which will then decode as "42#". So there is no way to fix the above string. Encoding "%3F" to "%253F" is exactly what a URI encoder should be doing.

Note: Having said this, browsers often allow you to get away with typing bad characters into URIs and they automatically encode them. That's not very robust so it shouldn't be used unless you are trying to be very forgiving of user input. In that case, you can do a "best effort" by first decoding the URI and then re-encoding it. In this case, if I wanted to type "42%23" I would have to manually type in "42%2523".

As for question 2:

This however causes the last param to be encoded as well

Similarly, this is exactly what you want. If a URI appears as a parameter inside another URI, it should be percent-encoded. Otherwise, how can you tell where one URI finishes and the other continues? I believe the above URI is actually valid (since ':', '/', '&' and '=' are reserved characters, not forbidden, and therefore they are allowed as long as they do not create ambiguity). But it is much safer to have a URI-inside-a-URI escaped.

like image 198
mgiuca Avatar answered Sep 30 '22 10:09

mgiuca


I really don't know, but you can try to first decode it, so the %3F will gets back what is was, and then encode it back.

So:

String decoded = URLDecoder.decode(url, "UTF-8");
url = URLEncoder.encode(decoded, "UTF-8");
like image 24
Martijn Courteaux Avatar answered Sep 30 '22 11:09

Martijn Courteaux