Match alpha letters and accented alpha letters

Question

I'm looking for some regex code with this pattern:

Must contain at least 1 of the following and match the whole string.
Can contain only alpha letters (a-z A-Z) ...
and accented alpha letters (á ä à etc).

I'm using preg_match('/^([\p{L}]*)$/iu', $input), but \p{L} matches all unicode letters, including Chinese. I just want to allow the English alphabet letters but also the accented variants of them.

So JohnDoe, Fübar, Lòrem, FírstNäme, Çákë would all be valid inputs, because they all contain at least 1 alpha letter and/or accented alpha letters, and the whole string matches.

zx81 · Accepted Answer

I would suggest this compact regex:

(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+

See demo.

This regex takes advantage of the fact that the accented letters you want all seem to live in the Unicode character range from À to ÿ (see this table), so we simply add it to the character class.
The À-ÿ has a few unwanted characters. Unlike some engines, PCRE (PHP's regex engine) does not support character class subtraction, but we mimic it with the negative lookahead (?![×Þß÷þø])
Be aware that some characters such as à can be expressed by several Unicode code points (the à grapheme, or an a with a grave accent). This will only match the non-combined graphemes. Catching all variations is really hard.

In your code:

$regex = "~(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+~u";
$hit = preg_match($regex,$subject,$match);

Andrew Mackrodt · Answer

I came up with the following solution using a combination of preg_match and iconv. Tested with php 5.5 on Windows and Linux:

$testWords = array(
    // pass
    'Çákë',
    'JohnDoe',
    'Fübar',
    'Lòrem',
    'FírstNäme',
    // fail
    'Ç@kë',
    'J0hnDoe',
    'F行bar',
    'L高rem',
    'F前rstNäme',
    'Ç学kë',
    '0'
);

$matchedWords = array_filter($testWords, function ($word) {
    // these characters should not be in the search string but may appear after iconv conversion
    $regexCharsNot = '\^~"`\'';

    $valid = false;

    if (!preg_match("/[$regexCharsNot]/u", $word)) {
        if ($word = @iconv('UTF-8', 'ASCII//TRANSLIT', $word)) {
            $valid = preg_match("/^[A-Za-z$regexCharsNot]+$/u", $word);
        }
    }

    return $valid;
});

echo print_r($matchedWords, true);

/*
Array
(
    [0] => Çákë
    [1] => JohnDoe
    [2] => Fübar
    [3] => Lòrem
    [4] => FírstNäme
)
 */

iconv and ASCII//TRANSLIT introduces extraneous characters which is why the $regexCharsNot double validation is required. I came up with that list using the following:

// mb_str_split regex           http://www.php.net/manual/en/function.mb-split.php#99851
// list of accented characters  http://fasforward.com/list-of-european-special-characters/

$accentedCharacters = preg_split(
    '/(?<!^)(?!$)/u',
    'ÄäÀàÁáÂâÃãÅåĄąĂăÆæÇçĆćĈĉČčĎđĐďðÈèÉéÊêËëĚěĘęĜĝĢģĤĥÌìÍíÎîÏïĴĵĶķĹĺĻļŁłĽľÑñŃńŇňÖöÒòÓóÔôÕõŐőØøŒœŔŕŘřßŚśŜŝŞşŠšŤťŢţÞþÜüÙùÚúÛûŰűŨũŲųŮůŴŵÝýŸÿŶŷŹźŽžŻż');

/*
$unsupported = ''; // 'Ǎǎẞ';

foreach ($accentedCharacters as $c) {
    if (!@iconv('UTF-8', 'ASCII//TRANSLIT', $c)) {
        $unsupported .= $c;
    }
}
*/

Match alpha letters and accented alpha letters

Tags:

regex

php

unicode

502 Error

2 Answers

zx81

Andrew Mackrodt

Recent Activity

Donate For Us

Match alpha letters and accented alpha letters

Tags:

regex

php

unicode

502 Error

2 Answers

zx81

Andrew Mackrodt

Related questions

Recent Activity

Donate For Us