Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better explanation of $convmap in mb_encode_numericentity()

The description given to this parameter, convmap, for method mb_encode_numericentity in the php manual is vague to me. Would somebody help with a better explanation of this, or maybe "dumb it down" if it should be sufficient for me? What is the meaning of the array elements used in this parameter? Example 1 in the manpage has

<?php
$convmap = array (
 int start_code1, int end_code1, int offset1, int mask1,
 int start_code2, int end_code2, int offset2, int mask2,
 ........
 int start_codeN, int end_codeN, int offsetN, int maskN );
// Specify Unicode value for start_codeN and end_codeN
// Add offsetN to value and take bit-wise 'AND' with maskN, then
// it converts value to numeric string reference.
?>

which is helpful, but then I see a lot of usage examples like array(0x80, 0xffff, 0, 0xffff); which throws me off. Does that mean the offset would be 0 and the mask would be 0xffff, if so, does offset mean number of characters in the string to start converting, and what does mask mean in this context?

like image 733
Nick Rolando Avatar asked Mar 07 '16 21:03

Nick Rolando


1 Answers

Looking down the rabbit hole, it appears that the comments in the documentation for mb_encode_numericentity are accurate, though somewhat cryptic.

The four major parts to the convmap appear to be:

start_code: The map affects items starting from this character code.
end_code: The map affects items up to this character code.
offset: Add a specific offset amount (positive or negative) for this character code.
mask: Value to be used for mask operation (character code bitwise AND mask value).

Character codes can be visualized via character tables such as this Codepage Layout example for ISO-8859-1 encoding. (ISO-8859-1 is the encoding used in the original PHP documentation Example #2.) Looking at this encoding table, we can see that the convmap is only meant to affect character code items that start from 0x80 (which appears to be blank for this particular encoding) to the final character in this encoding 0xff (which appears to be ÿ).

In order to better understand the offset and mask features of convmap, here are some examples of how offset and mask affect character codes (and in the examples below, our character code has a defined value of 162):

Plain Example:

<?php    
$original_str = "¢";
$convmap = array(0x00, 0xff, 0, 0xff);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#162;

Offset Example:

<?php
$original_str = "¢";
$convmap = array(0x00, 0xff, 1, 0xff);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#163;

Notes:

The offset seems to allow for a finer grain of control for the current start_code and end_code section of items-to-convert. For example, you might have some particular reason you need to add an offset for a certain line of character codes in your convmap, but then you might need to ignore that offset for another line in your convmap.


Mask Example:

<?php
// Mask Example 1
$original_str = "¢";
$convmap = array(0x00, 0xff, 0, 0xf0);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n\n";

// Mask Example 2
$convmap = array(0x00, 0xff, 0, 0x0f);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n\n";

// Mask Example 3
$convmap = array(0x00, 0xff, 0, 0x00);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#160;

original:  ¢
converted: &#2;

original:  ¢
converted: &#0;

Notes:

This answer does not intend to cover masking in great detail, but masking can help keep or remove certain bits from a given value.

Mask Example 1

So in the first mask example 0xf0, the f indicates that we want to keep the values on the left side of the binary value. Here, f has a binary value of 1111 and 0 has a binary value of 0000--together becoming a value of 11110000.

Then, when we do a bitwise AND operation with our character code (in this case, 162, which has a binary value of 10100010) the bitwise operation looks like this:

  11110000
& 10100010
----------
  10100000

And when converted back to its decimal value, 10100000 is 160.

Therefore, we've effectively kept the "left side" of the bits from the original character code value, and have gotten rid of the "right side" of the bits.

Mask Example 2

In the second mask example, the mask 0x0f (which has a binary value of 00001111) in the bitwise AND operation would have the following binary result:

  00001111
& 10100010
----------
  00000010

Which, when converted back to its decimal value, is 2.

Therefore, we've effectively kept the "right side" of the bits from the original character code value, and have gotten rid of the "left side" of the bits.

Mask Example 3

Finally, the third mask example shows what happens when using a mask of 0x00 (which is 00000000 in binary) in the bitwise AND operation:

  00000000
& 10100010
----------
  00000000

Which results in 0.

like image 183
summea Avatar answered Oct 21 '22 02:10

summea