Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to reliably auto-decode user files to Unicode? [C#]

I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.

Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.

This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?

I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.

My questions boil down to:

  1. Is BOM-aware detection sufficient for the vast majority of files?
  2. In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
  3. Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
  4. Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
  5. While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.

So far I've found:

  • A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
  • Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
  • Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
  • My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
  • Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.

Thanks.

like image 928
NVRAM Avatar asked Feb 22 '10 20:02

NVRAM


1 Answers

There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.

  • If the data starts with a BOM, use it.
  • If the data contains 0-bytes, it is likely utf-16 or ucs-32. You can distinguish between these, and between the big-endian and little-endian variants of these by looking at the positions of the 0-bytes
  • If the data can be decoded as utf-8 (without errors), then it is very likely utf-8 (or US-ASCII, but this is a subset of utf-8)
  • Next, if you want to go international, map the browser's language setting to the most likely encoding for that language.
  • Finally, assume ISO-8859-1

Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.

Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.

like image 155
oefe Avatar answered Nov 04 '22 23:11

oefe