-[NSMutableAttributedString initWithHTML:documentAttributes:] seems to mangle special characters:
NSString *html = @"“Hello” World"; // notice the smart quotes
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:nil];
NSLog(@"%@", as);
That prints “Hello†World followed by some RTF commands. In my application, I convert the attributed string to RTF and display it in an NSTextView, but the characters are corrupted there, too.
According to the documentation, the default encoding is UTF-8, but I tried being explicit and the result is the same:
NSDictionary *attributes = @{NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]};
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:&attributes];
Use [html dataUsingEncoding:NSUnicodeStringEncoding] when creating the NSData and set the matching encoding option when you parse the HTML into an attributed string:
The documentation for NSCharacterEncodingDocumentAttribute is slightly confusing:
NSNumber, containing an int specifying the
NSStringEncodingfor the file; for reading and writing plain text files and writing HTML; default for plain text is the default encoding; default for HTML is UTF-8.
So, you code should be:
NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)};
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options: options
documentAttributes:nil];
The previous answer here works, but mostly by accident.
Making an NSData with NSUnicodeStringEncoding will tend to work, because that constant is an alias for NSUTF16StringEncoding, and UTF-16 is pretty easy for the system to identify. Easier than UTF-8, which apparently was being identified as some other superset of ASCII (it looks like NSWindowsCP1252StringEncoding in your case, probably because it's one of the few ASCII-based encodings with mappings for 0x8_ and 0x9_).
That answer is mistaken in quoting the documentation for NSCharacterEncodingDocumentAttribute, because "attributes" are what you get out of -initWithHTML. That's why it's NSDictionary ** and not just NSDictionary *. You can pass in a pointer to an NSDictionary *, and you'll get out keys like TopMargin/BottomMargin/LeftMargin/RightMargin, PaperSize, DocumentType, UTI, etc. Any values you try to pass in through the "attributes" dictionary are ignored.
You need to use "options" for passing values in, and the relevant option key is NSTextEncodingNameDocumentOption, which has no documented default value. It's passing the bytes to WebKit for parsing, so if you don't specify an encoding, presumably you're getting WebKit's encoding-guessing heuristics.
To guarantee the encoding types match between your NSData and NSAttributedString, what you should do is something like:
NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options:@{NSTextEncodingNameDocumentOption: @"UTF-8"}
documentAttributes:nil];
Swift version of accepted answer is:
let htmlString: String = "Hello world contains html</br>"
let data: Data = Data(htmlString.utf8)
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try? NSAttributedString(data: data,
options: options,
documentAttributes: nil)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With