First of all, I found this: Objective C HTML escape/unescape, but it doesn't work for me.
My encoded characters (come from a RSS feed, btw) look like this: &
I searched all over the net and found related discussions, but no fix for my particular encoding, I think they are called hexadecimal characters.
Check out my NSString category for HTML. Here are the methods available:
- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
The one by Daniel is basically very nice, and I fixed a few issues there:
removed the skipping character for NSSCanner (otherwise spaces between two continuous entities would be ignored
[scanner setCharactersToBeSkipped:nil];
fixed the parsing when there are isolated '&' symbols (I am not sure what is the 'correct' output for this, I just compared it against firefox):
e.g.
&#ABC DF & B' & C' Items (288)
here is the modified code:
- (NSString *)stringByDecodingXMLEntities {
NSUInteger myLength = [self length];
NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;
// Short-circuit if there are no ampersands.
if (ampIndex == NSNotFound) {
return self;
}
// Make result string with some extra capacity.
NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];
// First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
NSScanner *scanner = [NSScanner scannerWithString:self];
[scanner setCharactersToBeSkipped:nil];
NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];
do {
// Scan up to the next entity or the end of the string.
NSString *nonEntityString;
if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
[result appendString:nonEntityString];
}
if ([scanner isAtEnd]) {
goto finish;
}
// Scan either a HTML or numeric character entity reference.
if ([scanner scanString:@"&" intoString:NULL])
[result appendString:@"&"];
else if ([scanner scanString:@"'" intoString:NULL])
[result appendString:@"'"];
else if ([scanner scanString:@""" intoString:NULL])
[result appendString:@"\""];
else if ([scanner scanString:@"<" intoString:NULL])
[result appendString:@"<"];
else if ([scanner scanString:@">" intoString:NULL])
[result appendString:@">"];
else if ([scanner scanString:@"&#" intoString:NULL]) {
BOOL gotNumber;
unsigned charCode;
NSString *xForHex = @"";
// Is it hex or decimal?
if ([scanner scanString:@"x" intoString:&xForHex]) {
gotNumber = [scanner scanHexInt:&charCode];
}
else {
gotNumber = [scanner scanInt:(int*)&charCode];
}
if (gotNumber) {
[result appendFormat:@"%C", (unichar)charCode];
[scanner scanString:@";" intoString:NULL];
}
else {
NSString *unknownEntity = @"";
[scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];
[result appendFormat:@"&#%@%@", xForHex, unknownEntity];
//[scanner scanUpToString:@";" intoString:&unknownEntity];
//[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
}
}
else {
NSString *amp;
[scanner scanString:@"&" intoString:&]; //an isolated & symbol
[result appendString:amp];
/*
NSString *unknownEntity = @"";
[scanner scanUpToString:@";" intoString:&unknownEntity];
NSString *semicolon = @"";
[scanner scanString:@";" intoString:&semicolon];
[result appendFormat:@"%@%@", unknownEntity, semicolon];
NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
*/
}
}
while (![scanner isAtEnd]);
finish:
return result;
}
As of iOS 7, you can decode HTML characters natively by using an NSAttributedString
with the NSHTMLTextDocumentType
attribute:
NSString *htmlString = @" & & < > ™ © ♥ ♣ ♠ ♦";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
options:options
documentAttributes:NULL
error:NULL];
The decoded attributed string will now be displayed as: & & < > ™ © ♥ ♣ ♠ ♦.
Note: This will only work if called on the main thread.
Those are called Character Entity References. When they take the form of &#<number>;
they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &
, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &
.
The reason the ampersand has to be encoded in RSS is it's a reserved special character.
What you need to do is parse the string and replace the entities with a byte matching the value between &#
and ;
. I don't know of any great ways to do this in objective C, but this stack overflow question might be of some help.
Edit: Since answering this some two years ago there are some great solutions; see @Michael Waterfall's answer below.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With