Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NSString from Unicode

I have a bunch of unicode characters wrapped up into NSNumber like so :

@(0x1f4de),    // EntypoIconTypePhone
@(0x1f4f1),    // EntypoIconTypeMobile
@(0xe789),     // EntypoIconTypeMouse
@(0xe723),     // EntypoIconTypeAddress
@(0x2709),     // EntypoIconTypeMail
@(0x1f53f),    // EntypoIconTypePaperPlane
@(0x270e),     // EntypoIconTypePencil

These are Icons from the Entypo font (highly recommended).

This is the code I am using to create NSString from the unicode :

NSNumber *u = self.unicodeLookup[type];

int unicode = [u intValue];
UniChar chars[] = {unicode};

NSString *string = [[NSString alloc] initWithCharacters:chars length:sizeof(chars) / sizeof(UniChar)];

What I am finding is that some of these icons are being created as expected but not all of them; and from what I can see it is the unicodes which have 5 digits in them that are not being created properly.

For example, these work :

@(0xe723),     // EntypoIconTypeAddress
@(0x2709),     // EntypoIconTypeMail

but these don't :

@(0x1f4de),    // EntypoIconTypePhone
@(0x1f4f1),    // EntypoIconTypeMobile

I'm pretty sure this is my conversion code. I don't really understand all this encoding malarky.

like image 928
Lee Probert Avatar asked Oct 04 '22 22:10

Lee Probert


1 Answers

If you store your character constants using unichar, rather than NSNumber objects, then the compiler itself will tell you the reason:

unichar chars[] = 
{
    0xe723,     // EntypoIconTypeAddress
    0x2709,     // EntypoIconTypeMail
    0x1f4de,    // EntypoIconTypePhone
    0x1f4f1     // EntypoIconTypeMobile
};

Implicit conversion from 'int' to 'unichar' (aka 'unsigned short') changes value from 128222 to 62686
Implicit conversion from 'int' to 'unichar' (aka 'unsigned short') changes value from 128241 to 62705

As iOS/OSX uses 16-bit representation of unicode characters internally, and 0x1f4de and 0x1f4f1 are both 32-bits, you are going to need to encode those characters as surrogate pairs:

a = 0x1f4de - 0x10000 = 0xf4de
high = a >> 10 = 0x3d
low = a & 0x3ff = 0xde
w1 = high + 0xd800 = 0xd83d
w2 = low + 0xdc00 = 0xdcde

0x1f4de (UTF-32) = 0xd83d 0xdcde (UTF-16)

(See this Wikipedia page).

The upshot is that you cannot use a single array of unicode characters as you are going to have to know the length of each character's encoding.

like image 54
trojanfoe Avatar answered Oct 09 '22 11:10

trojanfoe