How to keep json_encode() from dropping strings with invalid characters

Tags:

Is there a way to keep json_encode() from returning null for a string that contains an invalid (non-UTF-8) character?

It can be a pain in the ass to debug in a complex system. It would be much more fitting to actually see the invalid character, or at least have it omitted. As it stands, json_encode() will silently drop the entire string.

Example (in UTF-8):

$string =    array(utf8_decode("Düsseldorf"), // Deliberately produce broken string         "Washington",         "Nairobi");   print_r(json_encode($string));

Results in

[null,"Washington","Nairobi"]

Desired result:

["D�sseldorf","Washington","Nairobi"]

Note: I am not looking to make broken strings work in json_encode(). I am looking for ways to make it easier to diagnose encoding errors. A null string isn't helpful for that.

508

asked Jan 11 '11 23:01

Pekka

2 Answers

php does try to spew an error, but only if you turn display_errors off. This is odd because the display_errors setting is only meant to control whether or not errors are printed to standard output, not whether or not an error is triggered. I want to emphasize that when you have display_errors on, even though you may see all kinds of other php errors, php doesn't just hide this error, it will not even trigger it. That means it will not show up in any error logs, nor will any custom error_handlers get called. The error just never occurs.

Here's some code that demonstrates this:

error_reporting(-1);//report all errors $invalid_utf8_char = chr(193);  ini_set('display_errors', 1);//display errors to standard output var_dump(json_encode($invalid_utf8_char)); var_dump(error_get_last());//nothing  ini_set('display_errors', 0);//do not display errors to standard output var_dump(json_encode($invalid_utf8_char)); var_dump(error_get_last());// json_encode(): Invalid UTF-8 sequence in argument

That bizarre and unfortunate behavior is related to this bug https://bugs.php.net/bug.php?id=47494 and a few others, and doesn't look like it will ever be fixed.

workaround:

Cleaning the string before passing it to json_encode may be a workable solution.

$stripped_of_invalid_utf8_chars_string = iconv('UTF-8', 'UTF-8//IGNORE', $orig_string); if ($stripped_of_invalid_utf8_chars_string !== $orig_string) {     // one or more chars were invalid, and so they were stripped out.     // if you need to know where in the string the first stripped character was,      // then see http://stackoverflow.com/questions/7475437/find-first-character-that-is-different-between-two-strings } $json = json_encode($stripped_of_invalid_utf8_chars_string);

http://php.net/manual/en/function.iconv.php

The manual says

//IGNORE silently discards characters that are illegal in the target charset.

So by first removing the problematic characters, in theory json_encode() shouldnt get anything it will choke on and fail with. I haven't verified that the output of iconv with the //IGNORE flag is perfectly compatible with json_encodes notion of what valid utf8 characters are, so buyer beware...as there may be edge cases where it still fails. ugh, I hate character set issues.

Edit
in php 7.2+, there seems to be some new flags for json_encode: JSON_INVALID_UTF8_IGNORE and JSON_INVALID_UTF8_SUBSTITUTE
There's not much documentation yet, but for now, this test should help you understand expected behavior: https://github.com/php/php-src/blob/master/ext/json/tests/json_encode_invalid_utf8.phpt

And, in php 7.3+ there's the new flag JSON_THROW_ON_ERROR. See http://php.net/manual/en/class.jsonexception.php

108

answered Sep 27 '22 23:09

goat

This function will remove all invalid UTF8 chars from a string:

function removeInvalidChars( $text) {     $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';     return preg_replace($regex, '$1', $text); }

I use it after converting an Excel document to json, as Excel docs aren't guaranteed to be in UTF8.

I don't think there's a particularly sensible way of converting invalid chars to a visible but valid character. You could replace invalid chars with U+FFFD which is the unicode replacement character by turning the regex above around, but that really doesn't provide a better user experience than just dropping invalid chars.

answered Sep 27 '22 21:09

Danack

Related questions
                            
                                PHP: Is it possible to return multiple values from a function? [duplicate]
                            
                                Warning: mysql_fetch_array(): supplied argument is not a valid MySQL result
                            
                                Unexpected behaviour of current() in a foreach loop [duplicate]
                            
                                Stopping gearman workers nicely
                            
                                Can someone explain to me the pack() function in PHP?
                            
                                How to set up database-heavy unit tests in Symfony2 using PHPUnit?
                            
                                How do you get PHP, Symlinks and __FILE__ to work together nicely?
                            
                                Using print_r and var_dump with circular reference
                            
                                How does PHP's `mail` work?
                            
                                PHP: Check if variable exist but also if has a value equal to something
                            
                                How to set an Arrays internal pointer to a specific position? PHP/XML
                            
                                WooCommerce hook for "after payment complete" actions
                            
                                Simple application built with laravel? [closed]
                            
                                Best way to store passwords in MYSQL database [duplicate]
                            
                                Laravel not sending email and not giving errors
                            
                                How can I put double quotes inside a string within an ajax JSON response from php?
                            
                                What's the Difference Between Extension and zend_extension in php.ini?
                            
                                With "magic quotes" disabled, why does PHP/WordPress continue to auto-escape my POST data?
                            
                                Ajax phpmyadmin alternative? [closed]
                            
                                How can I overwrite a file through Dockerfile in docker container?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to keep json_encode() from dropping strings with invalid characters

Tags:

json

php

utf-8

Pekka

People also ask

2 Answers

goat

Danack

Recent Activity

Donate For Us