Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match only character set from the same language (like facebook name)?

preg_match(???, 'firstname lastname') // true;
preg_match(???, '서프 누워') // true;
preg_match(???, '서프 lastname') // false;
preg_match(???, '#$@ #$$#') // false;

Currently I use:

'/^([一-龠0-9\s]+|[ぁ-ゔ0-9\s]+|[ก-๙0-9\s]+|[ァ-ヴー0-9\s]+|[a-zA-Z0-9\s]+|[々〆〤0-9\s]+)$/u'

But it only works on some languages.

like image 936
newz Avatar asked Sep 28 '14 23:09

newz


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How do I match a specific character in regex?

Match any specific character in a setUse square brackets [] to match any characters in a set. Use \w to match any single alphanumeric character: 0-9 , a-z , A-Z , and _ (underscore). Use \d to match any single digit. Use \s to match any single whitespace character.

Is only letter regex code?

To get a string contains only letters (both uppercase or lowercase) we use a regular expression (/^[A-Za-z]+$/) which allows only letters.

What is the regular expression for characters?

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.


1 Answers

You need an expression that will match only characters from the same unicode script (and spaces), like:

 ^([\p{SomeScript} ]+|[\p{SomeOtherScript} ]+|...)$

You can build this expression dynamically from the list of scripts:

$scripts = "Hangul Hiragana Han Latin Cyrillic"; // feel free to add more

$re = [];
foreach(explode(' ', $scripts) as $s)
    $re [] = sprintf('[\p{%s} ]+', $s);
$re = "~^(" . implode("|", $re) . ")$~u";

print preg_match($re, 'firstname lastname'); // 1
print preg_match($re, '서프 누워'); // 1
print preg_match($re, '서프 lastname'); // 0
print preg_match($re, '#$@ #$$#'); // 0

Do note however, that it's common for names (at least, in European scripts I'm familiar with) to include characters like dots, dashes and apostrophes, which belong to the "Common" script rather than to a language-specific one. To take these into account, a more realistic version of a "chunk" in the above expression could be like this:

 ((\p{SomeScript}+(\. ?|[ '-]))*\p{SomeScript}+)

which will at least correctly validate L. A. Léon de Saint-Just.

In general, validating people's names is a complicated problem and cannot be solved with 100% accuracy. See this funny post and comments therein for details and examples.

like image 80
georg Avatar answered Oct 19 '22 10:10

georg