Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?

Tags:

php

utf-8

PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?

Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.

like image 750
Manos Dilaverakis Avatar asked Apr 16 '10 10:04

Manos Dilaverakis


People also ask

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Can UTF-8 use more than 8 bits?

UTF-8 and ASCII Character Chart UTF-8 is variable width character encoding method that uses one to four 8-bit bytes (8, 16, 32, 64 bits). This allows it to be backwards compatible with the original ASCII Characters 0-127, while providing millions of other characters from both modern and ancient languages.

What is the use of Str_replace in PHP?

The str_replace() function replaces some characters with some other characters in a string. This function works by the following rules: If the string to be searched is an array, it returns an array. If the string to be searched is an array, find and replace is performed with every array element.

Can UTF-8 encode all characters?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes.


2 Answers

Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.

In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.

This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).

like image 67
bobince Avatar answered Sep 22 '22 02:09

bobince


It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.

To visualise (abstractly):

  • a for an ASCII character
  • 2x for a 2-byte character
  • 3xx for a 3-byte character
  • 4xxx for a 4-byte character

If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.

Edit: See bobince's answer for a less abstract explanation.

like image 30
pinkgothic Avatar answered Sep 24 '22 02:09

pinkgothic