Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I best remove the unicode characters that XHTML regards as non-valid using php?

Tags:

php

xhtml

unicode

I run a forum designed to support an international mathematics group. I've recently switched it to unicode for better support of international characters. In debugging this conversion, I've discovered that not all unicode characters are considered as valid XHTML (the relevant website appears to be http://www.w3.org/TR/unicode-xml/). One of the steps that the forum software goes through before presenting the posts to the browser is an XHTML validation/sanitisation step. It seems a reasonable idea that at that stage it should remove any unicode characters that XHTML doesn't like.

So my question is:

Is there a standard (or best) way of doing this in PHP?

(The forum is written in PHP, by the way.)

I guess that the failsafe would be a simple str_replace (if that's also the best, do I need to do anything extra to make sure it works properly with unicode?) but that would involve me having to go through the XHTML DTD (or the above-referenced W3 page) carefully to figure out what characters to list in the search part of str_replace, so if this is the best way, has someone already done that so that I can steal, err, copy, it?

(Incidentally, the character that caused the problem was U+000C, the 'formfeed', which (according to the W3 page) is valid HTML but invalid XHTML!)

like image 498
Andrew Stacey Avatar asked Apr 13 '10 07:04

Andrew Stacey


2 Answers

I found a function that might do what you want on phpedit.net.

I'll post the function for the archive, credits to ltp on PHPEdit.net:

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}
like image 90
Bas Avatar answered Oct 24 '22 10:10

Bas


Assuming your input is utf8, you can remove unicode ranges with something like

 preg_replace('~[\x{17A3}-\x{17D3}]~u', '', $input);

Another, and better, approach is to remove everything by default and only whitelist chars you want to see. Unicode properties (\p) are quite practical for this. For example, removes everything except (unicode) letters and numbers:

  preg_replace('~[^\p{L}\p{N}]~u', '', $input)
like image 44
user187291 Avatar answered Oct 24 '22 10:10

user187291