Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CodePointAt equivalent in c#

Tags:

java

c#

unicode

I have this code in JAVA and works fine

    String a = "ABC";
    System.out.println(a.length());
    for (int n = 0; n < a.length(); n++)
        System.out.println(a.codePointAt(n));

The output as expected is 3 65 66 67 I am a little confused aboud a.length() because it is suposed to return the length in chars but String must store every < 256 char in 16 bits or whatever a unicode character would need.

But the question is how can i do the same i C#?. I need to scan a string and act depending on some unicode characters found.

The real code I need to translate is

    String str = this.getString();
    int cp;
    boolean escaping = false;
    for (int n = 0; n < len; n++)
    {
        //===================================================
        cp = str.codePointAt(n); //LOOKING FOR SOME EQUIVALENT IN C#
        //===================================================
        if (!escaping)
        {
          ....

       //Closing all braces below.

Thanks in advance.

How much i love JAVA :). Just need to deliver a Win APP that is a cliend of a Java / Linux app server.

like image 864
mdev Avatar asked May 20 '14 04:05

mdev


1 Answers

The exact translation would be this :

string a = "ABC⤶"; //Let's throw in a rare unicode char
Console.WriteLine(a.Length);
for (int n = 0; n < a.Length; n++)
    Console.WriteLine((int)a[n]); //a[n] returns a char, which we can cast in an integer
//final result : 4 65 66 68 10550

In C# you don't need codePointAt at all, you can get the unicode number directly by casting the character into an int (or for an assignation, it's casted implicitly). So you can get your cp simply by doing

cp = (int)str[n];

How much I love C# :)

However, this is valid only for low Unicode values. Surrogate pairs are handled as two different characters when you break the string down, so they won't be printed as one value. If you really need to handle UTF32, you can refer to this answer, which basically uses

int cp = Char.ConvertToUtf32(a, n);

after incrementing the loop by two (because it's coded on two chars), with the Char.IsSurrogatePair() condition.

Your translation would then become

string a = "ABC\U0001F01C";
Console.WriteLine(s.Count(x => !char.IsHighSurrogate(x)));
for (var i = 0; i < a.Length; i += char.IsSurrogatePair(a, i) ? 2 : 1)
    Console.WriteLine(char.ConvertToUtf32(a, i));

Please note the change from s.Length() to a little bit of LINQ for the count, because surrogates are counted as two chars. We simply count how many characters are not higher surrogates to get the clear count of actual characters.

like image 129
Pierre-Luc Pineault Avatar answered Nov 05 '22 01:11

Pierre-Luc Pineault