Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does PHP's mb_internal_encoding actually do?

Tags:

string

php

According to the PHP website it does this:

encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module. You should notice that the internal encoding is totally different from the one for multibyte regex.

Can someone please explain this in simpler terms?

  1. HTTP input character encoding conversion
  2. HTTP output character encoding conversion
  3. default character encoding for string functions
  4. What is meant by “internal encoding is totally different from the one for multibyte regex”?

My guess is that

  1. means GET and POST are treated as that encoding.
  2. means it outputs to that encoding.
  3. means it uses that encoding for all multibyte string functions.
  4. I have no idea about. Why would regex be different to normal string functions?

If point 2 is correct would you need to do:

ini_set('default_charset', 'UTF-8');

If I understand 3 correctly does that mean if you do:

mb_internal_encoding('UTF-8')

You don't need to do:

mb_strtolower($str, 'UTF-8');

Just:

mb_strtolower($str);

I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?

like image 893
texelate Avatar asked Mar 26 '14 06:03

texelate


1 Answers

The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.

For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.

The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.

I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?

I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:

mb_internal_encoding('UTF-8');
mb_strtolower($string);

However, this is not really:

mb_strtolower($string);

See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.

like image 174
deceze Avatar answered Oct 02 '22 18:10

deceze