I'm trying to parse an XML file using PHP, but I get an error message:
parser error : Char 0x0 out of allowed range in
I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?
I also get:
parser error : Premature end of data in tag item line
What might be causing that error?
I'm using simplexml_load_file
.
I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xml file over 100M, will it makes parse error?
Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[
.. ]]>
blocks.
And you also need to clear the invalid characters:
/** * Removes invalid XML * * @access public * @param string $value * @return string */ function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value[$i]); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $ret .= chr($current); } else { $ret .= " "; } } return $ret; }
I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.
$utf_8_range = range(0, 1114111); $output = ords_to_utfstring($utf_8_range); $sanitized = sanitize_for_xml($output); /** * Removes invalid XML * * @access public * @param string $value * @return string */ function sanitize_for_xml($input) { // Convert input to UTF-8. $old_setting = ini_set('mbstring.substitute_character', '"none"'); $input = mb_convert_encoding($input, 'UTF-8', 'auto'); ini_set('mbstring.substitute_character', $old_setting); // Use fast preg_replace. If failure, use slower chr => int => chr conversion. $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input); if (is_null($output)) { // Convert to ints. // Convert ints back into a string. $output = ords_to_utfstring(utfstring_to_ords($input), TRUE); } return $output; } /** * Given a UTF-8 string, output an array of ordinal values. * * @param string $input * UTF-8 string. * @param string $encoding * Defaults to UTF-8. * * @return array * Array of ordinal values representing the input string. */ function utfstring_to_ords($input, $encoding = 'UTF-8'){ // Turn a string of unicode characters into UCS-4BE, which is a Unicode // encoding that stores each character as a 4 byte integer. This accounts for // the "UCS-4"; the "BE" prefix indicates that the integers are stored in // big-endian order. The reason for this encoding is that each character is a // fixed size, making iterating over the string simpler. $input = mb_convert_encoding($input, "UCS-4BE", $encoding); // Visit each unicode character. $ords = array(); for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) { // Now we have 4 bytes. Find their total numeric value. $s2 = mb_substr($input, $i, 1, "UCS-4BE"); $val = unpack("N", $s2); $ords[] = $val[1]; } return $ords; } /** * Given an array of ints representing Unicode chars, outputs a UTF-8 string. * * @param array $ords * Array of integers representing Unicode characters. * @param bool $scrub_XML * Set to TRUE to remove non valid XML characters. * * @return string * UTF-8 String. */ function ords_to_utfstring($ords, $scrub_XML = FALSE) { $output = ''; foreach ($ords as $ord) { // 0: Negative numbers. // 55296 - 57343: Surrogate Range. // 65279: BOM (byte order mark). // 1114111: Out of range. if ( $ord < 0 || ($ord >= 0xD800 && $ord <= 0xDFFF) || $ord == 0xFEFF || $ord > 0x10ffff) { // Skip non valid UTF-8 values. continue; } // 9: Anything Below 9. // 11: Vertical Tab. // 12: Form Feed. // 14-31: Unprintable control codes. // 65534, 65535: Unicode noncharacters. elseif ($scrub_XML && ( $ord < 0x9 || $ord == 0xB || $ord == 0xC || ($ord > 0xD && $ord < 0x20) || $ord == 0xFFFE || $ord == 0xFFFF )) { // Skip non valid XML values. continue; } // 127: 1 Byte char. elseif ( $ord <= 0x007f) { $output .= chr($ord); continue; } // 2047: 2 Byte char. elseif ($ord <= 0x07ff) { $output .= chr(0xc0 | ($ord >> 6)); $output .= chr(0x80 | ($ord & 0x003f)); continue; } // 65535: 3 Byte char. elseif ($ord <= 0xffff) { $output .= chr(0xe0 | ($ord >> 12)); $output .= chr(0x80 | (($ord >> 6) & 0x003f)); $output .= chr(0x80 | ($ord & 0x003f)); continue; } // 1114111: 4 Byte char. elseif ($ord <= 0x10ffff) { $output .= chr(0xf0 | ($ord >> 18)); $output .= chr(0x80 | (($ord >> 12) & 0x3f)); $output .= chr(0x80 | (($ord >> 6) & 0x3f)); $output .= chr(0x80 | ($ord & 0x3f)); continue; } } return $output; }
And to do this on a simple object or array
// Recursive sanitize_for_xml. function recursive_sanitize_for_xml(&$input){ if (is_null($input) || is_bool($input) || is_numeric($input)) { return; } if (!is_array($input) && !is_object($input)) { $input = sanitize_for_xml($input); } else { foreach ($input as &$value) { recursive_sanitize_for_xml($value); } } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With