I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.
Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.
This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?
I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.
My questions boil down to:
So far I've found:
Thanks.
There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.
Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.
Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With