Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML character decoding in Objective-C / Cocoa Touch

First of all, I found this: Objective C HTML escape/unescape, but it doesn't work for me.

My encoded characters (come from a RSS feed, btw) look like this: &

I searched all over the net and found related discussions, but no fix for my particular encoding, I think they are called hexadecimal characters.

like image 792
treznik Avatar asked Jul 09 '09 16:07

treznik


4 Answers

Check out my NSString category for HTML. Here are the methods available:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
like image 169
Michael Waterfall Avatar answered Oct 21 '22 01:10

Michael Waterfall


The one by Daniel is basically very nice, and I fixed a few issues there:

  1. removed the skipping character for NSSCanner (otherwise spaces between two continuous entities would be ignored

    [scanner setCharactersToBeSkipped:nil];

  2. fixed the parsing when there are isolated '&' symbols (I am not sure what is the 'correct' output for this, I just compared it against firefox):

e.g.

    &#ABC DF & B'  & C' Items (288)

here is the modified code:

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"'" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@""" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"<" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", (unichar)charCode];

                [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

                [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


                [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
            NSString *amp;

            [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol
            [result appendString:amp];

            /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
             */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
like image 44
Walty Yeung Avatar answered Oct 21 '22 02:10

Walty Yeung


As of iOS 7, you can decode HTML characters natively by using an NSAttributedString with the NSHTMLTextDocumentType attribute:

NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
                                                 options:options
                                      documentAttributes:NULL
                                                   error:NULL];

The decoded attributed string will now be displayed as:  & & < > ™ © ♥ ♣ ♠ ♦.

Note: This will only work if called on the main thread.

like image 49
Bryan Luby Avatar answered Oct 21 '22 02:10

Bryan Luby


Those are called Character Entity References. When they take the form of &#<number>; they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &#038;, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &.

The reason the ampersand has to be encoded in RSS is it's a reserved special character.

What you need to do is parse the string and replace the entities with a byte matching the value between &# and ;. I don't know of any great ways to do this in objective C, but this stack overflow question might be of some help.

Edit: Since answering this some two years ago there are some great solutions; see @Michael Waterfall's answer below.

like image 46
Matt Bridges Avatar answered Oct 21 '22 01:10

Matt Bridges