Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no about whether the byte stream/array is valid.
Are there any sample codes to make reference? If no C# code, simple samples in C++/Java are also appreciated. Thanks!
For the invalid byte sequences of UTF-8, I mean
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
thanks in advance, George
What you need is DecoderFallback. When the Encoding class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:
Using UTF8Encoding and DecoderReplacementFallback you can achieve just what you're looking for.
static void CheckUTF8(byte[] data)
{
new UTF8Encoding(false, true).GetCharCount(data);
}
Throws a DecoderFallbackException on invalid data. DecoderFallbackException.Index should point to the index of the invalid sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With