Special Characters in NSString from HTML

Question

I'm fetching data from an XML source and parsing through it with tbxml. Everything is working fine until I get to a latin letter like the "é" it will display as: Code:

&#233;

I don't see a proper method of NSString to do the conversion. Any ideas?

johne · Accepted Answer

You can use a regex. A regex is a solution to, and cause of, all problems! :)

The example below uses, at least as of this writing, the unreleased RegexKitLite 4.0. You can get the 4.0 development snapshot via svn:

shell% svn co http://regexkit.svn.sourceforge.net/svnroot/regexkit regexkit

The examples below take advantage of the new 4.0 Blocks feature to do a search and replace of the é character entities.

This first example is the "simpler" of the two. It only handles decimal character entities like é and not hexadecimal character entities like é. If you can guarantee that you'll never have hexadecimal character entities, this should be fine:

#import <Foundation/Foundation.h>
#import "RegexKitLite.h"

int main(int argc, char *charv[]) {
  NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

  NSString *string = @"A test: &#233; and &#xe9; ? YAY! Even >0xffff are handled: &#119808; or &#x1D400;, see? (0x1d400 == MATHEMATICAL BOLD CAPITAL A)";
  NSString *regex = @"&#([0-9]+);";

  NSString *replacedString = [string stringByReplacingOccurrencesOfRegex:regex usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
      NSUInteger u16Length = 0UL, u32_ch = [capturedStrings[1] integerValue];
      UniChar u16Buffer[3];

      if (u32_ch <= 0xFFFFU)       { u16Buffer[u16Length++] = ((u32_ch >= 0xD800U) && (u32_ch <= 0xDFFFU)) ? 0xFFFDU : u32_ch; }
      else if (u32_ch > 0x10FFFFU) { u16Buffer[u16Length++] = 0xFFFDU; }
      else                         { u32_ch -= 0x0010000UL; u16Buffer[u16Length++] = ((u32_ch >> 10) + 0xD800U); u16Buffer[u16Length++] = ((u32_ch & 0x3FFUL) + 0xDC00U); }

      return([NSString stringWithCharacters:u16Buffer length:u16Length]);
    }];

  NSLog(@"replaced: '%@'", replacedString);

  return(0);
}

Compile and run with:

shell% gcc -arch i386 -g -o charReplace charReplace.m RegexKitLite.m -framework Foundation -licucore
shell% ./charReplace
2010-02-13 22:51:48.909 charReplace[35527:903] replaced: 'A test: é and &#xe9; ? YAY! Even >0xffff are handled: 𝐀 or &#x1D400;, see? (0x1d400 == MATHEMATICAL BOLD CAPITAL A)'

The 0x1d4000 character might not show up in your browser, but it looks like a bold A in a terminal window.

The "three lines" in the middle of the replacement block ensure correct conversion of UTF-32 characters that are > 0xFFFF. I put this in for completeness and correctness sake. Invalid UTF-32 character values (0xd800 - 0xdfff) are turned in to U+FFFD, or REPLACEMENT CHARACTER. If you can "guarantee" that you'll never have &#...; character entities that are > 0xFFFF (or 65535), and are always "legal" UTF-32, then you can remove those lines and simplify the whole block down to something like:

return([NSString stringWithFormat:@"%C", [capturedStrings[1] integerValue]]);

The second example does both decimal and hexadecimal character entities:

#import <Foundation/Foundation.h>
#import "RegexKitLite.h"

int main(int argc, char *charv[]) {
  NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

  NSString *string = @"A test: &#233; and &#xe9; ? YAY! Even >0xffff are handled: &#119808; or &#x1D400;, see? (0x1d400 == MATHEMATICAL BOLD CAPITAL A)";
  NSString *regex = @"&#(?:([0-9]+)|x([0-9a-fA-F]+));";

  NSString *replacedString = [string stringByReplacingOccurrencesOfRegex:regex usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
      NSUInteger u16Length = 0UL, u32_ch = 0UL;
      UniChar u16Buffer[3];

      CFStringRef cfSelf = (capturedRanges[1].location != NSNotFound) ? (CFStringRef)capturedStrings[1] : (CFStringRef)capturedStrings[2];
      UInt8 buffer[64];
      const char *cptr;

      if((cptr = CFStringGetCStringPtr(cfSelf, kCFStringEncodingMacRoman)) == NULL) {
        CFRange range     = CFRangeMake(0L, CFStringGetLength(cfSelf));
        CFIndex usedBytes = 0L;
        CFStringGetBytes(cfSelf, range, kCFStringEncodingUTF8, '?', false, buffer, 60L, &usedBytes);
        buffer[usedBytes] = 0;
        cptr              = (const char *)buffer;
      }

      u32_ch = strtoul(cptr, NULL, (capturedRanges[1].location != NSNotFound) ? 10 : 16);

      if (u32_ch <= 0xFFFFU)       { u16Buffer[u16Length++] = ((u32_ch >= 0xD800U) && (u32_ch <= 0xDFFFU)) ? 0xFFFDU : u32_ch; }
      else if (u32_ch > 0x10FFFFU) { u16Buffer[u16Length++] = 0xFFFDU; }
      else                         { u32_ch -= 0x0010000UL; u16Buffer[u16Length++] = ((u32_ch >> 10) + 0xD800U); u16Buffer[u16Length++] = ((u32_ch & 0x3FFUL) + 0xDC00U); }

      return([NSString stringWithCharacters:u16Buffer length:u16Length]);
    }];

  NSLog(@"replaced: '%@'", replacedString);

  return(0);
}

Again, compile and run with:

shell% gcc -arch i386 -g -o charReplace charReplace.m RegexKitLite.m -framework Foundation -licucore
shell% ./charReplace
2010-02-13 22:52:02.182 charReplace[35540:903] replaced: 'A test: é and é ? YAY! Even >0xffff are handled: 𝐀 or 𝐀, see? (0x1d400 == MATHEMATICAL BOLD CAPITAL A)'

Note the difference in the output compared to the first: The first still had é in it, and in this one it is replaced. Again, it's a tad longish, but I choose to go for completeness and correctness.

Both examples can have the stringByReplacingOccurrencesOfRegex: method replaced with the following for "extra speed", but you should refer to the documentation to see the caveats of using RKLRegexEnumerationFastCapturedStringsXXX. It's important to note that using it in the above is not a problem and perfectly safe (and one of the reasons why I added the option to RegexKitLite).

  NSString *replacedString = [string stringByReplacingOccurrencesOfRegex:regex options:RKLNoOptions inRange:NSMakeRange(0UL, [string length]) error:NULL enumerationOptions:RKLRegexEnumerationFastCapturedStringsXXX usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {

Another answer to your question pointed you to this Stack Overflow Question with an Answer. Differences between this solution and that solution (based on nothing more than a quick once over):

This solution:

Requires an external library (RegexKitLite).
Uses Blocks to perform its work, which is not available "everywhere" yet. Though there is Plausible Blocks, which lets you use Blocks on Mac OS X 10.5 and IPhone OS 2.2+ (I think). They backported the 10.6 gcc Blocks changes and made them available.

The other solution:

Uses standard Foundation classes, works everywhere.
A little less correct in handling some UTF-32 character code points (probably not an issue in practice).
Handles a couple of common named character entities like >. This can be added easily to the above, though.

I haven't benchmarked either solution, but I'd be willing to bet large sums of money that the RegexKitLite solution using RKLRegexEnumerationFastCapturedStringsXXX beats the pants off the NSScanner solution.

And if you really wanted to add named character entities, you could change the regex to something like:

NSString *regex = @"&(?:#(?:([0-9]+)|x([0-9a-fA-F]+))|([a-zA-Z][a-zA-Z0-9]+));";

Note: I haven't tested the above at all.

Capture #3 should contain "the character entity name", which you can then use to do a look up. A really fancy way to do this would be to have a NSDictionary that contains a named character as the key and a NSString object containing the character that that name maps to. You could even keep the whole thing as an external .plist resource and lazily load it on demand with something like:

NSDictionary *namedCharactersDictionary = [NSDictionary dictionaryWithContentsOfFile:@"namedCharacters.plist"];

You'd obviously tweak it to use NSBundle to get a path to your apps resource directory, but you get this idea. Then you'd add another condition check in the Block:

if(capturedRanges[3].location != NSNotFound) {
  NSString *namedCharacter = [namedCharactersDictionary objectForKey:capturedStrings[3]];
  return((namedCharacter == NULL) ? capturedStrings[0] : namedCharacter);
}

If the named character is in the dictionary, it will replace it. Otherwise it returns the full &notfound; matched text (i.e., "does nothing").

BlueVoid · Answer

This seems like a pretty common problem. Check out HTML character decoding in Objective-C / Cocoa Touch

Special Characters in NSString from HTML

Tags:

xml

iphone

nsstring

adamweeks

2 Answers

johne

BlueVoid

Recent Activity

Donate For Us

Special Characters in NSString from HTML

Tags:

xml

iphone

nsstring

adamweeks

2 Answers

johne

BlueVoid

Related questions

Recent Activity

Donate For Us