Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is mb_* necessary to replace single-byte characters from a multibyte string?

Tags:

php

utf-8

Let's say I have an UTF-8 text like this:

âàêíóôõ <br> âàêíóôõ <br> âàêíóôõ

I want to replace <br> with <br />. Do I need to use mb_str_replace or I can use str_replace ?

Consindering < b r / > are all single byte char?

like image 413
dynamic Avatar asked Feb 06 '12 19:02

dynamic


People also ask

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

Is multibyte character set?

Multibyte Character Set (MBCS): A character set encoded with a variable number of bytes for each character. Many large character sets have been defined as multi-byte character sets in order to keep strict compatibility with the standards of the ASCII subset, the ISO and IEC 2022.

Which function converts the wide character string to a multibyte string?

The wcstombs() function converts the wide-character string pointed to by string into the multibyte array pointed to by dest . The converted string begins in the initial shift state.

What is multibyte character C?

The term “multibyte character” is defined by ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the “extended character set.” A regular single-byte character is just a special case of a multibyte character.


2 Answers

Since str_replace is binary-safe and UTF-8 is a bijective encoding, you can use str_replace, even if search string or replacement contains multi-byte characters, as long as all three parameters are encoded as UTF-8.

That's why there isn't an mb_str_replace function in the first place.

If your encoding is not bijective - i.e. there are multiple representations of the same string, for example < in UTF-7, which can be expressed both as '+ADw-' and '<', you should convert all strings to the same (bijective) encoding, apply str_replace, and then convert the strings to the target encoding.

like image 190
phihag Avatar answered Sep 28 '22 02:09

phihag


Reference for manipulating UTF-8 strings safely in PHP. There is no hard-and-fast rule. Some native PHP string functions functions can operate safely on utf-8, some can with care, and some cannot.

There is no mb_str_replace(). Notice the section "UTF-8 Safe Functionality": explode() and str_replace() are safe as long as all three arguments to it are valid UTF-8 strings.

like image 28
Francis Avila Avatar answered Sep 28 '22 03:09

Francis Avila