Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect invalid html entities in PHP?

I have a bunch of text/html documents I'm processing

Some of them contain encoded html entities which I'm trying to convert into their raw decoded utf characters.

This is easy using html_entity_decode, however, some of the entities are invalid such as

򙦙

For this reason I'm using a regexp to pull out every individual entity, and then trying to validate them somehow.

If an entity is invalid, I want to leave it as 򙦙 in the document, but things like an encoded & would still become &.

Just some sample test code I knocked up..

<?php
function dump_chars($s)
{
    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = html_entity_decode($m, ENT_QUOTES, "UTF-8");

            echo "[" . htmlentities($m, ENT_QUOTES, "UTF-8") . "] ";
            echo "Decoded: [" . $decoded . "] ";
            echo "Hex: [" . bin2hex($decoded) . "] "; 
            echo "detect: [" . mb_detect_encoding($decoded) . "]";
            echo "<br>";
        }
    }
}

$payload = "&quot; &amp; &#x349; &#x92; &#x99999;";
echo "<html><head><meta charset='UTF-8'></head><body>";
dump_chars($payload);

I'm drawing a bit of a blank how best to validate the entity, would love some help please.

like image 467
carpii Avatar asked Jul 05 '14 23:07

carpii


1 Answers

I eventually found a way..

function decode_numeric_entities($s)
{
    $result = $s;
    $convmap = array(0x0, 0x2FFFF, 0, 0xFFFF);

    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = mb_decode_numericentity($m, $convmap, 'UTF-8');
            $result = str_replace($m, $decoded, $result);
        }
    }
    return $result;
}

Running a string through this func will convert all valid entities to their actual utf characters, leaving all the invalid ones left as entities

like image 162
carpii Avatar answered Oct 16 '22 08:10

carpii