Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL Escaping Chinese/Japanese Unicode Characters for Internet Explorer

I'm trying to URL-escape (percent-encode) non-ascii characters in several URLs I'm dealing with. I'm working with a flash application that loads resources like images and sound clips from these URLs. Since the filenames can contain non-ascii characters, like so: 日本語.jpg I escape them by utf-8 encoding the characters, and then percent-escaping the unicode bytes, to get the following:

%E6%97%A5%E6%9C%AC%E8%AA%9E.jpg

These filenames work fine when I run the app in any browser other than Internet Explorer - I've tried Firefox, Safari and Chrome. But when I launch the app in IE (tried both 6 and 8) and it tries to load the sound clip, I get: Error #2044: Unhandled ioError, and the URL has been corrupted to something like:

æ¥æ¬èª.jpg

Any thoughts on how to fix this? This is just test-driving the flash app with local filesystem URLs. I've also noticed that Internet explorer isn't able to locate a file such as: file:///C:/%E6%97%A5%E6%9C%AC%E8%AA%9E.jpg, though Chrome / Firefox will decode it and load just fine for a file with the path

C:\日本語.jpg

edit

I think my problem is the same as the one encountered in the following ActionScript code fragment:

import flash.display.Loader;
import flash.net.URLRequest;
...
var ldr:Loader;
var req:URLRequest = new URLRequest("日本語.jpg");
ldr = new Loader();
ldr.load(req);

Using the string 日本語.jpg will work in IE, while using the string %E6%97%A5%E6%9C%AC%E8%AA%9E.jpg works in other browsers. What I need is a single form that will work in all browsers. I have tried the %u encoding and setting the http request header to Content-Type: text/html; charset=utf-8 with no luck in either percent-escaped or unescaped form.

like image 266
Bear Avatar asked Nov 25 '09 04:11

Bear


3 Answers

IE uses UTF-8 for HTTP Urls, but I'm not sure about File URLs (even though I tested the behavior as part of the IE team about 10 years ago). If you are using the URLS in HTML, I'd actually recommend trying string literals (if your page encoding is UTF-8) or Numeric Character References (&#dddd;). IE will generally convert the characters into an appropriate encoding, which would be UTF-8 for the HTTP stuff, and UTF-16 for local file system interactions.

It's actually HTTP that needs the URL-escaping, not the HTML parser.

like image 36
JasonTrue Avatar answered Oct 24 '22 01:10

JasonTrue


Sorry, no solution, but maybe at least some more information about what might be going on here. (Probably you've already figured this much out, but maybe it will help another reader find a solution.) The "official" url encoding specification seems to leave the door wide open as to how to decode escaped urls like the ones you are generating--are the escaped entities intended to represent UTF-8 characters (as Firefox, etc. are interpretting them) or ASCII characters (as IE is interpretting them)? I don't know of any way to force the intended decoding strategy.

Just a question: what bad thing is happening if you do not escape them at all, but leave the unicode in the url? Although I don't have a lot of experience with it, I thought I remember reading somewhere that the days of needing to escape unicode in urls are behind us. Could be wrong about that...

like image 159
Dave Mateer Avatar answered Oct 23 '22 23:10

Dave Mateer


Try encoding only the parts of the URI that would cause it to be parsed incorrectly. For instance, encode &, ?, and space. Leave everything else as is, and it should work like a charm.

If you are still running into problems, You may need to set the content-type to utf in your http headers. Something like Content-type: text/html; charset=UTF-8.

like image 1
Bear Avatar answered Oct 23 '22 23:10

Bear