Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NSAttributedString initWithHTML incorrect character encoding?

Tags:

-[NSMutableAttributedString initWithHTML:documentAttributes:] seems to mangle special characters:

NSString *html = @"“Hello” World"; // notice the smart quotes
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:nil];
NSLog(@"%@", as);

That prints “Hello†World followed by some RTF commands. In my application, I convert the attributed string to RTF and display it in an NSTextView, but the characters are corrupted there, too.

According to the documentation, the default encoding is UTF-8, but I tried being explicit and the result is the same:

NSDictionary *attributes = @{NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]};
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:&attributes];
like image 698
alltom Avatar asked Apr 11 '13 18:04

alltom


3 Answers

Use [html dataUsingEncoding:NSUnicodeStringEncoding] when creating the NSData and set the matching encoding option when you parse the HTML into an attributed string:

The documentation for NSCharacterEncodingDocumentAttribute is slightly confusing:

NSNumber, containing an int specifying the NSStringEncoding for the file; for reading and writing plain text files and writing HTML; default for plain text is the default encoding; default for HTML is UTF-8.

So, you code should be:

NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
                                    NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)};
NSMutableAttributedString *as =
    [[NSMutableAttributedString alloc] initWithHTML:htmlData
                                            options: options
                                 documentAttributes:nil];
like image 135
alltom Avatar answered Oct 02 '22 14:10

alltom


The previous answer here works, but mostly by accident.

Making an NSData with NSUnicodeStringEncoding will tend to work, because that constant is an alias for NSUTF16StringEncoding, and UTF-16 is pretty easy for the system to identify. Easier than UTF-8, which apparently was being identified as some other superset of ASCII (it looks like NSWindowsCP1252StringEncoding in your case, probably because it's one of the few ASCII-based encodings with mappings for 0x8_ and 0x9_).

That answer is mistaken in quoting the documentation for NSCharacterEncodingDocumentAttribute, because "attributes" are what you get out of -initWithHTML. That's why it's NSDictionary ** and not just NSDictionary *. You can pass in a pointer to an NSDictionary *, and you'll get out keys like TopMargin/BottomMargin/LeftMargin/RightMargin, PaperSize, DocumentType, UTI, etc. Any values you try to pass in through the "attributes" dictionary are ignored.

You need to use "options" for passing values in, and the relevant option key is NSTextEncodingNameDocumentOption, which has no documented default value. It's passing the bytes to WebKit for parsing, so if you don't specify an encoding, presumably you're getting WebKit's encoding-guessing heuristics.

To guarantee the encoding types match between your NSData and NSAttributedString, what you should do is something like:

NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];

NSMutableAttributedString *as =
    [[NSMutableAttributedString alloc] initWithHTML:htmlData
                                            options:@{NSTextEncodingNameDocumentOption: @"UTF-8"}
                                 documentAttributes:nil];
like image 28
user3277676 Avatar answered Oct 02 '22 14:10

user3277676


Swift version of accepted answer is:

let htmlString: String = "Hello world contains html</br>"
let data: Data = Data(htmlString.utf8)

let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
    .documentType: NSAttributedString.DocumentType.html,
    .characterEncoding: String.Encoding.utf8.rawValue
]

let attributedString = try? NSAttributedString(data: data,
    options: options,
    documentAttributes: nil)
like image 28
mathema Avatar answered Oct 02 '22 14:10

mathema