Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode numeric HTML entities in PHP

I'm trying to decode encoded long dash from numeric entity to string, but it seems that I can't find a function which can do this properly.

The best that I found is mb_decode_numericentity(), however, for some reason it fails to decode long dash and some other special characters.

$str = '–';

$str = mb_decode_numericentity($str, array(0xFF, 0x2FFFF, 0, 0xFFFF), 'ISO-8859-1');

This will return "?".

Anyone knows how to solve this problem?

like image 572
Yuriy Avatar asked May 04 '10 11:05

Yuriy


People also ask

What's the difference between HTML entities () and htmlspecialchars ()?

Difference between htmlentities() and htmlspecialchars() function: The only difference between these function is that htmlspecialchars() function convert the special characters to HTML entities whereas htmlentities() function convert all applicable characters to HTML entities.

What is HTML entity decode?

HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and > are encoded as &lt; and &gt; for HTTP transmission.

How HTML encode in PHP?

Definition and Usage. The htmlentities() function converts characters to HTML entities. Tip: To convert HTML entities back to characters, use the html_entity_decode() function. Tip: Use the get_html_translation_table() function to return the translation table used by htmlentities().

What does Htmlspecialchars do in PHP?

Definition and Usage The htmlspecialchars() function converts some predefined characters to HTML entities.


2 Answers

The following code snippet (mostly stolen from here and improved) will work for literal, numeric decimal, and numeric hexa-decimal entities:

header("content-type: text/html; charset=utf-8");

/**
* Decodes all HTML entities, including numeric and hexadecimal ones.
* 
* @param mixed $string
* @return string decoded HTML
*/

function html_entity_decode_numeric($string, $quote_style = ENT_COMPAT, $charset = "utf-8")
{
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);
return $string; 
}

/** 
 * Callback helper 
 */

function chr_utf8_callback($matches)
 { 
  return chr_utf8(hexdec($matches[1])); 
 }

/**
* Multi-byte chr(): Will turn a numeric argument into a UTF-8 string.
* 
* @param mixed $num
* @return string
*/

function chr_utf8($num)
{
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}


$string ="&#x201D;"; 

echo html_entity_decode_numeric($string);

Improvement suggestions are welcome.

like image 108
Pekka Avatar answered Sep 29 '22 14:09

Pekka


mb_decode_numericentity does not handle hexadecimal, only decimal. Do you get the expected result with:

$str = '–';

$str = mb_decode_numericentity ( $str , Array(255, 3145727, 0, 65535) , 'ISO-8859-1');

You can use hexdec to convert your hexadecimal to decimal.

Also, out of curiosity, does the following work:

$str = '&#8211;';

 $str = html_entity_decode($str);
like image 20
Anthony Avatar answered Sep 29 '22 14:09

Anthony