Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Multibyte String Functions

Today I ran into a problem with the php function strpos() because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not.

Now I have noticed that using the mb_strpos function solved my problem.

My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos, strlen, ereg, etc., etc. functions at all?

Notice: I don't want to set mbstring.func_overload global in php.ini, because this leads to other problems when using the PEAR library. I am using PHP4.

like image 520
prinzdezibel Avatar asked Dec 23 '22 12:12

prinzdezibel


1 Answers

It depends on the character encoding you are using. In single-byte character encodings, or UTF-8 (where a single byte inside a character can never be mistaken for another character), then as long as the string you are searching in and the string you are using to search are in the same encoding then you can continue to use the regular string search functions.

If you are using a multi-byte encoding other than UTF-8, which does not prevent single bytes within a character from appearing like other characters, then it is never safe to do a string search using the regular string search functions. You may find false positives. This is because PHP's string comparison in functions such as strpos is per-byte, and with the exception of UTF-8 which is specifically designed to prevent this problem, multi-byte encodings suffer the problem that any subsequent byte in a character made up of more than one byte may match part of a different character.

If the string you are searching in and the string you are searching for are of different character encodings, then conversion will always be necessary. Otherwise you'll find that for any string that would be represented differently in the other encoding, it will always return false. You should do such conversion on input: decide on a character encoding your app will use, and be consistent within the application. Any time you receive input in a different encoding, convert on the way in.

like image 187
thomasrutter Avatar answered Jan 08 '23 00:01

thomasrutter