Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

preg_match and (non-English) Latin characters?

I have a XHTML form where I ask people to enter their full name. I then match that with preg_match() using this pattern: /^[\p{L}\s]+$/

On my local server running PHP 5.2.13 (PCRE 7.9 2009-04-11) this works fine. On the webhost running PHP 5.2.10 (PCRE 7.3 2007-08-28) it doesn't match when the entered string contains the Danish Latin character ø ( http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F8&mode=char ).

Is this a bug? Is there a work around?

Thank you in advance!

like image 217
Jonas Delfs Avatar asked Mar 24 '11 19:03

Jonas Delfs


People also ask

What does Preg_match mean in PHP?

Definition and Usage The preg_match() function returns whether a match was found in a string.

Which of the following call to Preg_match will return false?

preg_match() returns 1 if the pattern matches given subject , 0 if it does not, or false on failure. This function may return Boolean false , but may also return a non-Boolean value which evaluates to false .

Which of the following call to Preg_match will return true?

The preg_match() function returns true if pattern matches otherwise, it returns false.

How do I match a string in PHP?

The strcmp() function compares two strings. Note: The strcmp() function is binary-safe and case-sensitive. Tip: This function is similar to the strncmp() function, with the difference that you can specify the number of characters from each string to be used in the comparison with strncmp().


1 Answers

So, the problem is as presumed. You are not using the /u modifier. This means that PCRE will not look for UTF-8 characters.

In any case, this is how it should be done:

var_dump(preg_match('/^[\p{L}\s]+$/u', "ø")); 

And works on all my versions. There might be a bug in others, but that's not likely here.

Your problem is that this also works:

var_dump(preg_match('/^[\p{L}\s]+$/', utf8_decode("ø")));

Notice that this uses ISO-8859-1 instead of UTF-8, and leaves out the /u modifier. The result is int(1). Obviously PCRE interprets the Latin-1 ø as matching \p{L} when in non-/unicode mode. (Most of the single-byte \xA0-\xFF are letter symbols in Latin-1, and the 8-bit code point as the same as in Unicode, so that's actually ok.)

Conclusion: Your input is actually ISO-8859-1. That's why it accidentally worked for you without the /u. Change that, and be eaxact with input charsets.

like image 72
mario Avatar answered Oct 01 '22 20:10

mario