Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match alpha letters and accented alpha letters

Tags:

regex

php

unicode

I'm looking for some regex code with this pattern:

  • Must contain at least 1 of the following and match the whole string.

  • Can contain only alpha letters (a-z A-Z) ...

  • and accented alpha letters (á ä à etc).

I'm using preg_match('/^([\p{L}]*)$/iu', $input), but \p{L} matches all unicode letters, including Chinese. I just want to allow the English alphabet letters but also the accented variants of them.

So JohnDoe, Fübar, Lòrem, FírstNäme, Çákë would all be valid inputs, because they all contain at least 1 alpha letter and/or accented alpha letters, and the whole string matches.

like image 363
502 Error Avatar asked Jun 22 '14 22:06

502 Error


2 Answers

I would suggest this compact regex:

(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+

See demo.

  1. This regex takes advantage of the fact that the accented letters you want all seem to live in the Unicode character range from À to ÿ (see this table), so we simply add it to the character class.
  2. The À-ÿ has a few unwanted characters. Unlike some engines, PCRE (PHP's regex engine) does not support character class subtraction, but we mimic it with the negative lookahead (?![×Þß÷þø])
  3. Be aware that some characters such as à can be expressed by several Unicode code points (the à grapheme, or an a with a grave accent). This will only match the non-combined graphemes. Catching all variations is really hard.

In your code:

$regex = "~(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+~u";
$hit = preg_match($regex,$subject,$match);
like image 179
zx81 Avatar answered Oct 13 '22 23:10

zx81


I came up with the following solution using a combination of preg_match and iconv. Tested with php 5.5 on Windows and Linux:

$testWords = array(
    // pass
    'Çákë',
    'JohnDoe',
    'Fübar',
    'Lòrem',
    'FírstNäme',
    // fail
    'Ç@kë',
    'J0hnDoe',
    'F行bar',
    'L高rem',
    'F前rstNäme',
    'Ç学kë',
    '0'
);

$matchedWords = array_filter($testWords, function ($word) {
    // these characters should not be in the search string but may appear after iconv conversion
    $regexCharsNot = '\^~"`\'';

    $valid = false;

    if (!preg_match("/[$regexCharsNot]/u", $word)) {
        if ($word = @iconv('UTF-8', 'ASCII//TRANSLIT', $word)) {
            $valid = preg_match("/^[A-Za-z$regexCharsNot]+$/u", $word);
        }
    }

    return $valid;
});

echo print_r($matchedWords, true);

/*
Array
(
    [0] => Çákë
    [1] => JohnDoe
    [2] => Fübar
    [3] => Lòrem
    [4] => FírstNäme
)
 */

iconv and ASCII//TRANSLIT introduces extraneous characters which is why the $regexCharsNot double validation is required. I came up with that list using the following:

// mb_str_split regex           http://www.php.net/manual/en/function.mb-split.php#99851
// list of accented characters  http://fasforward.com/list-of-european-special-characters/

$accentedCharacters = preg_split(
    '/(?<!^)(?!$)/u',
    'ÄäÀàÁáÂâÃãÅåĄąĂăÆæÇçĆćĈĉČčĎđĐďðÈèÉéÊêËëĚěĘęĜĝĢģĤĥÌìÍíÎîÏïĴĵĶķĹĺĻļŁłĽľÑñŃńŇňÖöÒòÓóÔôÕõŐőØøŒœŔŕŘřߌśŜŝŞşŠšŤťŢţÞþÜüÙùÚúÛûŰűŨũŲųŮůŴŵÝýŸÿŶŷŹźŽžŻż');

/*
$unsupported = ''; // 'Ǎǎẞ';

foreach ($accentedCharacters as $c) {
    if (!@iconv('UTF-8', 'ASCII//TRANSLIT', $c)) {
        $unsupported .= $c;
    }
}
*/
like image 37
Andrew Mackrodt Avatar answered Oct 13 '22 21:10

Andrew Mackrodt