Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to skip invalid characters in XML file using PHP

Tags:

php

xml

utf-8

I'm trying to parse an XML file using PHP, but I get an error message:

parser error : Char 0x0 out of allowed range in

I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?

I also get:

parser error : Premature end of data in tag item line

What might be causing that error?

I'm using simplexml_load_file.

Update:

I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xml file over 100M, will it makes parse error?

like image 229
user315396 Avatar asked Aug 12 '10 08:08

user315396


Video Answer


2 Answers

Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[ .. ]]> blocks.

And you also need to clear the invalid characters:

/**  * Removes invalid XML  *  * @access public  * @param string $value  * @return string  */ function stripInvalidXml($value) {     $ret = "";     $current;     if (empty($value))      {         return $ret;     }       $length = strlen($value);     for ($i=0; $i < $length; $i++)     {         $current = ord($value[$i]);         if (($current == 0x9) ||             ($current == 0xA) ||             ($current == 0xD) ||             (($current >= 0x20) && ($current <= 0xD7FF)) ||             (($current >= 0xE000) && ($current <= 0xFFFD)) ||             (($current >= 0x10000) && ($current <= 0x10FFFF)))         {             $ret .= chr($current);         }         else         {             $ret .= " ";         }     }     return $ret; } 
like image 183
Jhong Avatar answered Oct 12 '22 10:10

Jhong


I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.

$utf_8_range = range(0, 1114111); $output = ords_to_utfstring($utf_8_range); $sanitized = sanitize_for_xml($output);   /**  * Removes invalid XML  *  * @access public  * @param string $value  * @return string  */ function sanitize_for_xml($input) {   // Convert input to UTF-8.   $old_setting = ini_set('mbstring.substitute_character', '"none"');   $input = mb_convert_encoding($input, 'UTF-8', 'auto');   ini_set('mbstring.substitute_character', $old_setting);    // Use fast preg_replace. If failure, use slower chr => int => chr conversion.   $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);   if (is_null($output)) {     // Convert to ints.     // Convert ints back into a string.     $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);   }   return $output; }  /**  * Given a UTF-8 string, output an array of ordinal values.  *  * @param string $input  *   UTF-8 string.  * @param string $encoding  *   Defaults to UTF-8.  *  * @return array  *   Array of ordinal values representing the input string.  */ function utfstring_to_ords($input, $encoding = 'UTF-8'){   // Turn a string of unicode characters into UCS-4BE, which is a Unicode   // encoding that stores each character as a 4 byte integer. This accounts for   // the "UCS-4"; the "BE" prefix indicates that the integers are stored in   // big-endian order. The reason for this encoding is that each character is a   // fixed size, making iterating over the string simpler.   $input = mb_convert_encoding($input, "UCS-4BE", $encoding);    // Visit each unicode character.   $ords = array();   for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {     // Now we have 4 bytes. Find their total numeric value.     $s2 = mb_substr($input, $i, 1, "UCS-4BE");     $val = unpack("N", $s2);     $ords[] = $val[1];   }   return $ords; }  /**  * Given an array of ints representing Unicode chars, outputs a UTF-8 string.  *  * @param array $ords  *   Array of integers representing Unicode characters.  * @param bool $scrub_XML  *   Set to TRUE to remove non valid XML characters.  *  * @return string  *   UTF-8 String.  */ function ords_to_utfstring($ords, $scrub_XML = FALSE) {   $output = '';   foreach ($ords as $ord) {     // 0: Negative numbers.     // 55296 - 57343: Surrogate Range.     // 65279: BOM (byte order mark).     // 1114111: Out of range.     if (   $ord < 0         || ($ord >= 0xD800 && $ord <= 0xDFFF)         || $ord == 0xFEFF         || $ord > 0x10ffff) {       // Skip non valid UTF-8 values.       continue;     }     // 9: Anything Below 9.     // 11: Vertical Tab.     // 12: Form Feed.     // 14-31: Unprintable control codes.     // 65534, 65535: Unicode noncharacters.     elseif ($scrub_XML && (                $ord < 0x9             || $ord == 0xB             || $ord == 0xC             || ($ord > 0xD && $ord < 0x20)             || $ord == 0xFFFE             || $ord == 0xFFFF             )) {       // Skip non valid XML values.       continue;     }     // 127: 1 Byte char.     elseif ( $ord <= 0x007f) {       $output .= chr($ord);       continue;     }     // 2047: 2 Byte char.     elseif ($ord <= 0x07ff) {       $output .= chr(0xc0 | ($ord >> 6));       $output .= chr(0x80 | ($ord & 0x003f));       continue;     }     // 65535: 3 Byte char.     elseif ($ord <= 0xffff) {       $output .= chr(0xe0 | ($ord >> 12));       $output .= chr(0x80 | (($ord >> 6) & 0x003f));       $output .= chr(0x80 | ($ord & 0x003f));       continue;     }     // 1114111: 4 Byte char.     elseif ($ord <= 0x10ffff) {       $output .= chr(0xf0 | ($ord >> 18));       $output .= chr(0x80 | (($ord >> 12) & 0x3f));       $output .= chr(0x80 | (($ord >> 6) & 0x3f));       $output .= chr(0x80 | ($ord & 0x3f));       continue;     }   }   return $output; } 

And to do this on a simple object or array

// Recursive sanitize_for_xml. function recursive_sanitize_for_xml(&$input){   if (is_null($input) || is_bool($input) || is_numeric($input)) {     return;   }   if (!is_array($input) && !is_object($input)) {     $input = sanitize_for_xml($input);   }   else {     foreach ($input as &$value) {       recursive_sanitize_for_xml($value);     }   } } 
like image 39
mikeytown2 Avatar answered Oct 12 '22 12:10

mikeytown2