Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble encoding a u umlaut with in a .Net http handler

I have a JavaScript request going to a ASP.Net (2.0) HTTP handler which passes the request to a java web service. In this system special characters, such as those with an accent do not get passed on correctly.

E.G.

  • Human input: Düsseldorf
  • becomes a JavaScript asynch request to http://site/serviceproxy.ashx?q=D%FCsseldorf, which is valid in ISO-8859-1 as well as in UTF-8 as far as I can tell. (unless it's %c3%bc in UTF-8)
  • HttpContext.Current.Request.QueryString.Get("q") returns D�sseldorf which is where trouble begins.
  • but HttpUtility.UrlEncode(HttpContext.Current.Request.QueryString.Get("q"), Encoding.GetEncoding("ISO-8859-1")) returns D%3fsseldorf (a '?')
  • and HttpUtility.UrlEncode(HttpContext.Current.Request.QueryString.Get("q"), Encoding.UTF8) returns D%ef%bfsseldorf

So it the value doesn't get decoded nor re-encoded correctly to be passed on to the java service.

  • Notice HttpContext.Current.Request.Url.Query is ?q=D%FCsseldorf&output=json&from=1&to=10
  • while HttpContext.Current.Request.QueryString.ToString() is q=D%ufffdsseldorf&output=json&from=1&to=10

Why is this, and how can I tell the HttpContext to honor the request headers which include:

Content-Type=application/x-www-form-urlencoded;+charset=UTF-8

and decode the URL's QueryString using the UTF-8 charset.

Addendum: As the answer notes, the trouble lies not so much in the decoding as the encoding; using escape() in JavaScript does not escape according to UTF-8, while using encodeURIComponent() does.

like image 771
dlamblin Avatar asked Nov 25 '08 22:11

dlamblin


1 Answers

I don't know what the default character encoding used by your server (IIS?) is, or if it can be changed, but I can tell you a few things that might help.

0xFC is the ISO-8859-1 encoding for ü. While the Unicode code point is U+00FC, when encoded with UTF-8, this requires two bytes, and becomes 0xC3 0xBC.

If a UTF-8 decoder were to see the illegal byte sequence 0xFC, it would decode it as a Unicode "replacement character", U+FFFD, and pick up where it saw the beginning of another valid byte sequence, in this case 's'.

The reason you get %3f is that '?' is the "replacement character" for the Latin character set, similar to � in the Unicode character set.

I believe what you're seeing is the client encoding with ISO-8859-1, but the server is decoding with UTF-8. As soon as it hits the server, your data is corrupted. I recommend that you modify the client to use UTF-8 encoding; it should be requesting http://site/serviceproxy.ashx?q=D%C3%BCsseldorf

It sounds like you are constructing these URLs from JavaScript, so you should use the encodeURI and encodeURIComponent functions, not escape.

like image 80
erickson Avatar answered Oct 05 '22 10:10

erickson