I have a NSString containing a unicode character bigger than U+FFFF, like the MUSICAL SYMBOL G CLEF symbol '𝄞'. I can create the NSString and display it.
NSString *s = @"A\U0001d11eB"; // "A𝄞B"
NSLog(@"String = \"%@\"", s);
The log is correct and displays the 3 characters. This tells me the NSString is well done and there is no encoding problem.
String = "A𝄞B"
But when I try to loop through all characters using the method
- (unichar)characterAtIndex:(NSUInteger)index
everything goes wrong.
The type unichar is 16 bits so I expect to get the wrong character for the musical symbol. But the length of the string is also incorrect!
NSLog(@"Length = %d", [s length]);
for (int i=0; i<[s length]; i++)
{
NSLog(@" Character %d = %c", i, [s characterAtIndex:i]);
}
displays
Length = 4
Character 0 = A
Character 1 = 4
Character 2 = .
Character 3 = B
What methods should I use to correctly parse my NSString and get my 3 unicode characters? Ideally the right method should return a type like wchar_t in place of unichar.
Thank you
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters.
Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.
Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.
Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
NSString *s = @"A\U0001d11eB";
NSData *data = [s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
const wchar_t *wcs = [data bytes];
for (int i = 0; i < [data length]/4; i++) {
NSLog(@"%#010x", wcs[i]);
}
Output:
0x00000041 0x0001d11e 0x00000042
(The code assumes that wchar_t
has a size of 4 bytes and little-endian encoding.)
length
and charAtIndex:
do not give the expected result because \U0001d11e
is internally stored as UTF-16 "surrogate pair".
Another useful method for general Unicode strings is
[s enumerateSubstringsInRange:NSMakeRange(0, [s length])
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
NSLog(@"%@", substring);
}];
Output:
A 𝄞 B
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With