Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle 32bit unicode characters in a NSString

I have a NSString containing a unicode character bigger than U+FFFF, like the MUSICAL SYMBOL G CLEF symbol '𝄞'. I can create the NSString and display it.

NSString *s = @"A\U0001d11eB";  // "A𝄞B"
NSLog(@"String = \"%@\"", s);

The log is correct and displays the 3 characters. This tells me the NSString is well done and there is no encoding problem.

    String = "A𝄞B"

But when I try to loop through all characters using the method

- (unichar)characterAtIndex:(NSUInteger)index

everything goes wrong.

The type unichar is 16 bits so I expect to get the wrong character for the musical symbol. But the length of the string is also incorrect!

NSLog(@"Length = %d", [s length]);
for (int i=0; i<[s length]; i++)
{
    NSLog(@"  Character %d = %c", i, [s characterAtIndex:i]);
}

displays

    Length = 4
      Character 0 = A
      Character 1 = 4
      Character 2 = .
      Character 3 = B

What methods should I use to correctly parse my NSString and get my 3 unicode characters? Ideally the right method should return a type like wchar_t in place of unichar.

Thank you

like image 620
PatrickV Avatar asked Dec 12 '13 07:12

PatrickV


People also ask

How many characters can 32 bit Unicode store?

Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters.

How many bits does a Unicode character require?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

How many characters can you really store with 16-bit Unicode?

Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.

How many bytes does it take to store a Unicode character?

Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


1 Answers

NSString *s = @"A\U0001d11eB";
NSData *data = [s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
const wchar_t *wcs = [data bytes];
for (int i = 0; i < [data length]/4; i++) {
    NSLog(@"%#010x", wcs[i]);
}

Output:

0x00000041
0x0001d11e
0x00000042

(The code assumes that wchar_t has a size of 4 bytes and little-endian encoding.)

length and charAtIndex: do not give the expected result because \U0001d11e is internally stored as UTF-16 "surrogate pair".

Another useful method for general Unicode strings is

[s enumerateSubstringsInRange:NSMakeRange(0, [s length])
              options:NSStringEnumerationByComposedCharacterSequences
           usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
    NSLog(@"%@", substring);
}];

Output:

A
𝄞
B
like image 108
Martin R Avatar answered Sep 25 '22 23:09

Martin R