Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match string with and without special/accented characters?

Tags:

regex

php

Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.

Like céra will match cera, and vice versa.

Any ideas?

Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.

Test example:

$clientName   = 'céra';
$this->search = 'cera';

$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search      = strtolower($this->search);

if (strpos($compareClientName, $this->search) !== false)
{
    $clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}

Output: <span class="highlight">céra</span>

As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.

I'll have to combine this with Michael Sivolobov's answer somehow, I guess.

I think I'll have to work with a separate preg_match() and preg_replace(), right?

like image 949
jlmmns Avatar asked Sep 26 '13 08:09

jlmmns


Video Answer


2 Answers

As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:

$clientName   = 'céra';
$this->search = 'cera';

$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search      = strtolower($this->search);

$pieces = explode($compareClientName, $this->search);

if (count($pieces) > 1)
{
    $clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}

Edit:

If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:

$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
    $highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
         '<span class="highlight">'.
         mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
    $offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');

Update 2:

It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:

echo strlen('é');
> 2

echo mb_strlen('é');
> 2

echo mb_internal_encoding();
> ISO-8859-1

echo mb_strlen('é', 'UTF-8');
> 1

mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1
like image 162
Adam Zielinski Avatar answered Oct 12 '22 09:10

Adam Zielinski


You can use the \p{L} pattern to match any letter.

Source

You have to use the u modifier after the regular expression to enable unicode mode.

Example : /\p{L}+/u

Edit :

Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.

function mbStringToArray($string)
{
    $strlen = mb_strlen($string);
    while($strlen)
    {
        $array[] = mb_substr($string, 0, 1, "UTF-8");
        $string = mb_substr($string, 1, $strlen, "UTF-8");
        $strlen = mb_strlen($string);
    }
    return $array;
}

// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
    return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}

$clientName = 'céra';

$clientNameNoAccent = stripAccents($clientName);

$clientNameArray = mbStringToArray($clientName);

foreach($clientNameArray as $pos => &$char)
{
    $charNA =$clientNameNoAccent[$pos];
    if($char != $charNA)
    {
        $char = "(?:$char|$charNA|$charNA\p{M})";
    }
}

$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra

$text = 'the client name is Céra but it could be Cera or céra too.';

$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);

echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.
like image 41
Kethryweryn Avatar answered Oct 12 '22 10:10

Kethryweryn