Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle non-ASCII filenames in XHR uploading

I have pretty standard javascript/XHR drag-and-drop file upload code, and just came across an unfortunate real-world snag. I have a file on my (Win7) desktop called "TEST-é-TEST.txt". In Chrome (30.0.1599.69), it arrives at the server with filename in UTF-8, which works out fine. In Firefox (24.0), the filename seems mangled when it arrives at the server.

I didn't trust what Firebug/Chrome might be telling me about the encoding, so I examined the hex of the request packet. Everything else is the same except the non-ASCII character is indeed being encoded differently in the two browsers:

Chrome: C3 A9 (this is the expected UTF-8 for that character)
Firefox: EF BF BD (UTF-8 "replacement character"?!)

Is this a Firefox bug? I tried renaming the file, replacing the é with ó, and the Firefox hex was the same... so such a mangle really seems like a browser bug. (If Firefox were confusedly sending along ISO-8859-1, for example, without touching it, I'd see an E9 byte, and I could handle that on the server side, but it shouldn't mangle it!)

Regardless of the reason, is there something I can do on either the client or server sides to correct for this? If a replacement character is indeed being sent to the server, then it would seem unrecoverable there, so I almost certainly need to do it on the client side.

And yes, the page on which this code exists has charset=utf-8, and Firefox confirms that it perceives the page as UTF-8 under View>Character Encoding.

Furthermore, if I dump the filename to console.log, it appears fine there--I guess it's just getting mangled in/after setRequestHeader("X-File-Name",file.name).

Finally, it would seem that the value passed to setRequestHeader() should be able to have code points up to U+00FF, so U+00E9 (é) and U+00F3 (ó) shouldn't cause a problem, though higher codes could trigger a SyntaxError: http://www.w3.org/TR/XMLHttpRequest2/#the-setrequestheader-method

like image 994
dlo Avatar asked Oct 08 '13 16:10

dlo


1 Answers

Thanks so much for Boris's help. Here's a summary of what I discovered through our interactions in comments:

1) The core issue is that HTTP Request headers are supposed to be ISO-8859-1. Prior versions of Chrome and Firefox both passed along UTF-8 strings unchanged in setRequestHeader() calls. This changed in FF24.0 (and apparently will be changing in Chrome soon too), such that FF drops high bytes and passes along only the low byte for each character. In the example I gave in the question, this was recoverable, but characters with higher codes could be mangled irretrievably.

2) One workaround would be to encode on the client side, e.g.:

setRequestHeader('X-File-Name',encodeURIComponent(filename))

and then decode on the server side, e.g. in PHP:

$filename=rawurldecode($_SERVER['HTTP_X_FILE_NAME'])

3) Note that this is only problematic because my ajax file upload approach is to send the raw file data in the request body, so I need to send the filename via a custom request header (as shown in many tutorials online). If I used FormData instead, I wouldn't have to worry about this. I believe if you want solid, standards-based unicode filename support, you should use FormData and not the request header approach.

like image 162
dlo Avatar answered Oct 01 '22 11:10

dlo