Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

looking for samples to validate UTF-8

Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no about whether the byte stream/array is valid.

Are there any sample codes to make reference? If no C# code, simple samples in C++/Java are also appreciated. Thanks!

For the invalid byte sequences of UTF-8, I mean

http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

thanks in advance, George

like image 558
George2 Avatar asked Apr 17 '26 13:04

George2


2 Answers

What you need is DecoderFallback. When the Encoding class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:

  • Either report error and stop processing.
  • Or find the error and replace it.

Using UTF8Encoding and DecoderReplacementFallback you can achieve just what you're looking for.

like image 125
DreamSonic Avatar answered Apr 19 '26 01:04

DreamSonic


static void CheckUTF8(byte[] data)
{
    new UTF8Encoding(false, true).GetCharCount(data);
}

Throws a DecoderFallbackException on invalid data. DecoderFallbackException.Index should point to the index of the invalid sequence.

like image 39
Nuno Cruces Avatar answered Apr 19 '26 02:04

Nuno Cruces



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!