I am pulling data from a website via NSURLConnection
and stashing the received data away in an instance of NSMutableData
. In the connectionDidFinishLoading
delegate method the data is convert into a string with a call to NSString's appropriate method:
NSString *result = [[NSString alloc] initWithData:data
encoding:NSUTF8StringEncoding]
The resulting string turns out to be a null. If I use the NSASCIIStringEncoding
, however, I do obtain the appropriate string, albeit with unicode characters garbled up as expected. The server's Content-Type
header does not specify the UTF-8 encoding, but I have attempted a number of different websites with a similar scenario, and there string conversion happens just fine. It seems like the problem only pertains to the given web service but I have no clue why.
On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?
Much appreciated!
You say that it “is definitely UTF-8”, but without a Content-Type header, you don't really know that. (And even if you did have a header saying that, it could still be wrong.)
My guess is that your data is usually ASCII, which always parses correctly as UTF-8, but you sometimes are trying to parse data that's actually encoded in ISO 8859-1 or Windows codepage 1252. Such data will generally be mostly ASCII, but with some bytes outside the 0–127 range ASCII defines. UTF-8 would expect such bytes to form a sequence of code units within a specified sequence of ranges, but in other encodings, any byte, regardless of value, is a complete character on its own. Trying to interpret non-ASCII non-UTF-8 data as UTF-8 will almost always get you either wrong results (wrong characters) or no results at all (cannot decode; decoder returns nil
), because the data was never encoded in UTF-8 in the first place.
You should try UTF-8 first, and if it fails, use ISO 8859-1. If you're letting the user retrieve any web page, you should let them change the encoding you use to decode the data, in case they discover that it was actually 8859-9 or codepage-1252 or some other 8-bit encoding.
If you're downloading the data from a specific server, and especially if you have influence on what runs on that server, you should make it serve up an accurate Content-Type header and/or fix whatever bug is causing it to serve up text that isn't in UTF-8.
As Peter said, the content-type Header is just an "hint" of what the content sent is expected to be. On server side you can set any content-type and send any bytes sequences, which can be invalid.
I had exactly the same issue dealing with incorrect UTF-8 data, which included ISO-8859-1 (Latin-1) characters (french accents).
Wikipedia about UTF-8 is worth reading to understand this issue and how to handle encoding errors.
The fact is that NSString initWithData:encoding:
strict implementation just return nil when a decoding error occurs. (unlike java for instance which use a replacement character)
The peter solution of converting a mostly UTF-8 data into Latin-1 was not satisfying me. (All UTF-8 characters becomes incorrect, for just one Latin 1 erratic character)
Best option would be a fix on server side, sure, but I'm not responsible on this side...
So I looked deeper, and found a solution using GNU libiconv C library (available on OSX and iOS) The principle is using iconv to remove non UTF-8 invalid characters (i.e. "prété" will become "prt")
Here is a sample code, equivalent of the command line iconv -c -f UTF-8 -t UTF-8 invalid.txt > cleaned.txt
#include "iconv.h"
- (NSData *)cleanUTF8:(NSData *)data {
iconv_t cd = iconv_open("UTF-8", "UTF-8"); // convert to UTF-8 from UTF-8
int one = 1;
iconvctl(cd, ICONV_SET_DISCARD_ILSEQ, &one); // discard invalid characters
size_t inbytesleft, outbytesleft;
inbytesleft = outbytesleft = data.length;
char *inbuf = (char *)data.bytes;
char *outbuf = malloc(sizeof(char) * data.length);
char *outptr = outbuf;
if (iconv(cd, &inbuf, &inbytesleft, &outptr, &outbytesleft)
== (size_t)-1) {
NSLog(@"this should not happen, seriously");
return nil;
}
NSData *result = [NSData dataWithBytes:outbuf length:data.length - outbytesleft];
iconv_close(cd);
free(outbuf);
return result;
}
Then the resulting NSData
can be safely decoded using NSUTF8StringEncoding
Note that latest iconv also allow fallback methods by using :
iconvctl(cd, ICONV_SET_FALLBACKS, &fallbacks);
By using a fallback on unicode errors, you can use a replacement character, or better, to try another encoding. In my case I managed to fallback to LATIN-1 where UTF-8 failed, which resulted in 99% positive conversions. Look at iconv source code for understanding it.
The default encoding for HTTP if none is specified is ISO-8859-1. If the HTTP response is compliant to HTTP/1.1 and it's not specifying a character set encoding, that is the encoding it is using.
Try decoding the string with that NSISOLatin1StringEncoding.
The data might have been in another encoding of unicode, such as UTF16, or in some totally different encodings.
There're libraries which can guess the encoding used in a data, but that should be a last resort. If you're using a web service, that web service should have a documentation which says which encoding it uses. Look for it, or ask the provider of the web service which encoding it uses. If neither is available, you should try to get a sample data and determine the encoding for that, and use that in the program.
On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?
That depends on the size of the data. If it's small, that would be perfectly fine. If it's big, it would be better to deal with the data piecemeal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With