Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to recognize if a string contains unicode chars?

I have a string and I want to know if it has unicode characters inside or not. (if its fully contains ASCII or not)

How can I achieve that?

Thanks!

like image 588
Himberjack Avatar asked Dec 16 '10 10:12

Himberjack


People also ask

How do I find a Unicode character in a string?

Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.

What is a Unicode character string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

Does string use Unicode?

A character string, or “Unicode string”, is a string where each unit is a character. Depending on the implementation, each character can be any Unicode character, or only characters in the range U+0000—U+FFFF, range called the Basic Multilingual Plane (BMP).


1 Answers

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.

    public void test()     {         const string WithUnicodeCharacter = "a hebrew character:\uFB2F";         const string WithoutUnicodeCharacter = "an ANSI character:Æ";          bool hasUnicode;          //true         hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);         Console.WriteLine(hasUnicode);          //false         hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);         Console.WriteLine(hasUnicode);     }      public bool ContainsUnicodeCharacter(string input)     {         const int MaxAnsiCode = 255;          return input.Any(c => c > MaxAnsiCode);     } 

Update

This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.

like image 153
Tim Lloyd Avatar answered Sep 26 '22 02:09

Tim Lloyd