Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug in PHP Preg engine: look-around Unicode issue

Tags:

regex

php

Why following js code words:

"آرد@".replace(/(?=.)/g,'!'); // returns: ""!آ!ر!د""

But its php equivalent returns '!�!�!�!�!�!�'?

preg_replace('/(?=.)/u', '!', 'آرد'); //returns '!�!�!�!�!�!�'

This works only in 4.3.5 - 5.0.5, 5.1.1 - 5.1.6 versions.

See: http://3v4l.org/jrV0W

like image 674
Handsome Nerd Avatar asked Feb 18 '13 07:02

Handsome Nerd


People also ask

What does Preg_match mean in PHP?

Definition and Usage The preg_match() function returns whether a match was found in a string.

Which of the following call to Preg_match will return false?

preg_match() returns 1 if the pattern matches given subject , 0 if it does not, or false on failure. This function may return Boolean false , but may also return a non-Boolean value which evaluates to false .

How do I match a string in PHP?

You can use the PHP strcmp() function to easily compare two strings. This function takes two strings str1 and str2 as parameters. The strcmp() function returns < 0 if str1 is less than str2 ; returns > 0 if str1 is greater than str2 , and 0 if they are equal.

Which of the following call to Preg_match will return true?

The preg_match() function returns true if pattern matches otherwise, it returns false.


1 Answers

If you simply add the /u modifier, the pattern is supposed to be treated as utf-8. The second example works because:

  1. Since PHP 5.1, you can use \p{L} that can be translated as: "is any kind of letter from any language."
  2. In addition to the standard notation, \p{L}, Java, Perl, PCRE and now PHP allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

UPDATE: Why preg_replace('/(?=.)/u', '!', 'آرد'); //returns '!�!�!�!�!�!�'??

As @MarkFox says, the reason is because in the context of preg_replace() it assumes one byte per character and the characters you're "RegExing" are multibyte. That's why your replace output has double the matches you'd expect, it's matching each byte of each character (which I infer to be two bytes) -

No matter what you do with your document encoding, you will need to use Unicode character properties to get this working.

What about that weird symbol?

When you see that "weird square symbol with a question mark inside" otherwise known as the REPLACEMENT CHARACTER, that is usually an indicator that you have a byte in the range of 80-FF (128-255) and the system is trying to render it in UTF-8.

That entire byte-range is invalid for single-byte characters in UTF-8, but are all very common in western encodings such as ISO-8859-1.

like image 50
Tom Sarduy Avatar answered Sep 27 '22 23:09

Tom Sarduy