I'm looking for some regex code with this pattern:
Must contain at least 1 of the following and match the whole string.
Can contain only alpha letters (a-z A-Z) ...
and accented alpha letters (á ä à etc).
I'm using preg_match('/^([\p{L}]*)$/iu', $input)
, but \p{L}
matches all unicode letters, including Chinese. I just want to allow the English alphabet letters but also the accented variants of them.
So JohnDoe
, Fübar
, Lòrem
, FírstNäme
, Çákë
would all be valid inputs, because they all contain at least 1 alpha letter and/or accented alpha letters, and the whole string matches.
I would suggest this compact regex:
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+
See demo.
À
to ÿ
(see this table), so we simply add it to the character class. À-ÿ
has a few unwanted characters. Unlike some engines, PCRE
(PHP's regex engine) does not support character class subtraction, but we mimic it with the negative lookahead (?![×Þß÷þø])
à
can be expressed by several Unicode code points (the à
grapheme, or an a
with a grave accent). This will only match the non-combined graphemes. Catching all variations is really hard.In your code:
$regex = "~(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+~u";
$hit = preg_match($regex,$subject,$match);
I came up with the following solution using a combination of preg_match
and iconv
. Tested with php 5.5 on Windows and Linux:
$testWords = array(
// pass
'Çákë',
'JohnDoe',
'Fübar',
'Lòrem',
'FírstNäme',
// fail
'Ç@kë',
'J0hnDoe',
'F行bar',
'L高rem',
'F前rstNäme',
'Ç学kë',
'0'
);
$matchedWords = array_filter($testWords, function ($word) {
// these characters should not be in the search string but may appear after iconv conversion
$regexCharsNot = '\^~"`\'';
$valid = false;
if (!preg_match("/[$regexCharsNot]/u", $word)) {
if ($word = @iconv('UTF-8', 'ASCII//TRANSLIT', $word)) {
$valid = preg_match("/^[A-Za-z$regexCharsNot]+$/u", $word);
}
}
return $valid;
});
echo print_r($matchedWords, true);
/*
Array
(
[0] => Çákë
[1] => JohnDoe
[2] => Fübar
[3] => Lòrem
[4] => FírstNäme
)
*/
iconv
and ASCII//TRANSLIT
introduces extraneous characters which is why the $regexCharsNot
double validation is required. I came up with that list using the following:
// mb_str_split regex http://www.php.net/manual/en/function.mb-split.php#99851
// list of accented characters http://fasforward.com/list-of-european-special-characters/
$accentedCharacters = preg_split(
'/(?<!^)(?!$)/u',
'ÄäÀàÁáÂâÃãÅåĄąĂăÆæÇçĆćĈĉČčĎđĐďðÈèÉéÊêËëĚěĘęĜĝĢģĤĥÌìÍíÎîÏïĴĵĶķĹĺĻļŁłĽľÑñŃńŇňÖöÒòÓóÔôÕõŐőØøŒœŔŕŘřߌśŜŝŞşŠšŤťŢţÞþÜüÙùÚúÛûŰűŨũŲųŮůŴŵÝýŸÿŶŷŹźŽžŻż');
/*
$unsupported = ''; // 'Ǎǎẞ';
foreach ($accentedCharacters as $c) {
if (!@iconv('UTF-8', 'ASCII//TRANSLIT', $c)) {
$unsupported .= $c;
}
}
*/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With