Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting an NSString to and from UTF32

I'm working with a database that includes hex codes for UTF32 characters. I would like to take these characters and store them in an NSString. I need to have routines to convert in both ways.

To convert the first character of an NSString to a unicode value, this routine seems to work:

const unsigned char *cs = (const unsigned char *)
    [s cStringUsingEncoding:NSUTF32StringEncoding];
uint32_t code = 0;
for ( int i = 3 ; i >= 0 ; i-- ) {
    code <<= 8;
    code += cs[i];
}
return code;

However, I am unable to do the reverse (i.e. take a single code and convert it into an NSString). I thought I could just do the reverse of what I do above by simply creating a c-string with the UTF32 character in it with the bytes in the correct order, and then create an NSString from that using the correct encoding.

However, converting to / from cstrings does not seem to be reversible for me.

For example, I've tried this code, and the "tmp" string is not equal to the original string "s".

char *cs = [s cStringUsingEncoding:NSUTF32StringEncoding];
NSString *tmp = [NSString stringWithCString:cs encoding:NSUTF32StringEncoding];

Does anyone know what I am doing wrong? Should I be using "wchar_t" for the cstring instead of char *?

Any help is greatly appreciated!

Thanks, Ron

like image 653
Ron Avatar asked Dec 05 '22 21:12

Ron


1 Answers

You have a couple of reasonable options.

1. Conversion

The first is to convert your UTF32 to UTF16 and use those with NSString, as UTF16 is the "native" encoding of NSString. It's not actually all that hard. If the UTF32 character is in the BMP (e.g. it's high two bytes are 0's), you can just cast it to unichar directly. If it's in any other plane, you can convert it to a surrogate pair of UTF16 characters. You can find the rules on the wikipedia page. But a quick (untested) conversion would look like

UTF32Char inputChar = // my UTF-32 character
inputChar -= 0x10000;
unichar highSurrogate = inputChar >> 10; // leave the top 10 bits
highSurrogate += 0xD800;
unichar lowSurrogate = inputChar & 0x3FF; // leave the low 10 bits
lowSurrogate += 0xDC00;

Now you can create an NSString using both characters at the same time:

NSString *str = [NSString stringWithCharacters:(unichar[]){highSurrogate, lowSurrogate} length:2];

To go backwards, you can use [NSString getCharacters:range:] to get the unichar's back and then reverse the surrogate pair algorithm to get your UTF32 character back (any characters which aren't in the range 0xD800-0xDFFF should just be cast to UTF32 directly).

2. Byte buffers

Your other option is to let NSString do the conversion directly without using cStrings. To convert a UTF32 value into an NSString you can use something like the following:

UTF32Char inputChar = // input UTF32 value
inputChar = NSSwapHostIntToLittle(inputChar); // swap to little-endian if necessary
NSString *str = [[[NSString alloc] initWithBytes:&inputChar length:4 encoding:NSUTF32LittleEndianStringEncoding] autorelease];

To get it back out again, you can use

UTF32Char outputChar;
if ([str getBytes:&outputChar maxLength:4 usedLength:NULL encoding:NSUTF32LittleEndianStringEncoding options:0 range:NSMakeRange(0, 1) remainingRange:NULL]) {
    outputChar = NSSwapLittleIntToHost(outputChar); // swap back to host endian
    // outputChar now has the first UTF32 character
}
like image 129
Lily Ballard Avatar answered Dec 07 '22 10:12

Lily Ballard