Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

json_encode() non utf-8 strings?

So I have an array of strings, and all of the strings are using the system default ANSI encoding and were pulled from a SQL database. So there are 256 different possible character byte values (single byte encoding).
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like \u0082?

Or is that the standard for JSON?

like image 322
Josh Avatar asked Jul 07 '11 06:07

Josh


People also ask

Is JSON always UTF-8?

JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8 . However, many servers (like Play framework itself) uses application/json without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.

What does the PHP function json_encode () do?

The json_encode() function is used to encode a value to JSON format.

Can I JSON encode a string?

These values (namely value1,value2, value3,...) can contain any special characters. JSON is an acronym for JavaScript Object Notation , so your asking if there is a JS way to encode/decode a JavaScript Object from and to a string? The answer is yes: JSON.

Can JSON handle UTF-8?

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.


2 Answers

Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like "\u0082"?

If you have an ANSI encoded string, using utf8_encode() is the wrong function to deal with this. You need to properly convert it from ANSI to UTF-8 first. That will certainly reduce the number of Unicode escape sequences like \u0082 from the json output, but technically these sequences are valid for json, you must not fear them.

Converting ANSI to UTF-8 with PHP

json_encode works with UTF-8 encoded strings only. If you need to create valid json successfully from an ANSI encoded string, you need to re-encode/convert it to UTF-8 first. Then json_encode will just work as documented.

To convert an encoding from ANSI (more correctly I assume you have a Windows-1252 encoded string, which is popular but wrongly referred to as ANSI) to UTF-8 you can make use of the mb_convert_encoding() function:

$str = mb_convert_encoding($str, "UTF-8", "Windows-1252"); 

Another function in PHP that can convert the encoding / charset of a string is called iconv based on libiconv. You can use it as well:

$str = iconv("CP1252", "UTF-8", $str); 

Note on utf8_encode()

utf8_encode() does only work for Latin-1, not for ANSI. So you will destroy part of your characters inside that string when you run it through that function.


Related: What is ANSI format?


For a more fine-grained control of what json_encode() returns, see the list of predifined constants (PHP version dependent, incl. PHP 5.4, some constants remain undocumented and are available in the source code only so far).

Changing the encoding of an array/iteratively (PDO comment)

As you wrote in a comment that you have problems to apply the function onto an array, here is some code example. It's always needed to first change the encoding before using json_encode. That's just a standard array operation, for the simpler case of pdo::fetch() a foreach iteration:

while($row = $q->fetch(PDO::FETCH_ASSOC)) {   foreach($row as &$value)   {     $value = mb_convert_encoding($value, "UTF-8", "Windows-1252");   }   unset($value); # safety: remove reference   $items[] = array_map('utf8_encode', $row ); } 
like image 94
hakre Avatar answered Sep 26 '22 06:09

hakre


The JSON standard ENFORCES Unicode encoding. From RFC4627:

3.  Encoding     JSON text SHALL be encoded in Unicode.  The default encoding is    UTF-8.     Since the first two characters of a JSON text will always be ASCII    characters [RFC0020], it is possible to determine whether an octet    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking    at the pattern of nulls in the first four octets.             00 00 00 xx  UTF-32BE            00 xx 00 xx  UTF-16BE            xx 00 00 00  UTF-32LE            xx 00 xx 00  UTF-16LE            xx xx xx xx  UTF-8 

Therefore, on the strictest sense, ANSI encoded JSON wouldn't be valid JSON; this is why PHP enforces unicode encoding when using json_encode().

As for "default ANSI", I'm pretty sure that your strings are encoded in Windows-1252. It is incorrectly referred to as ANSI.

like image 37
Andrew Moore Avatar answered Sep 25 '22 06:09

Andrew Moore