Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preg_Replace and UTF8

Tags:

regex

php

utf-8

I'm enhancing our video search page to highlight the search term(s) in the results. Because user can enter judas priest and a video has Judas Priest in it's text I have to use regular expressions to preserve the case of the original text.

My code works, but I have problems with special characters like š, č and ž, it seems that Preg_Replace() will only match if the case is the same (despite the /ui modifier). My code:

$Content = Preg_Replace ( '/\b(' . $term . '?)\b/iu', '<span class="HighlightTerm">$1</span>', $Content );

I also tried this:

$Content = Mb_Eregi_Replace ( '\b(' . $term . '?)\b', '<span class="HighlightTerm">\\1</span>', $Content );

But it also doesn't work. It will match "SREČA" if the search term is "SREČA", but if the search term is "sreča" it will not match it (and vice versa).

So how do I make this work?

update: I set the locale and internal encoding:

Mb_Internal_Encoding ( 'UTF-8' );
$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);
like image 267
Jan Hančič Avatar asked Jan 14 '10 09:01

Jan Hančič


People also ask

What is the difference between Preg_replace and Str_replace?

str_replace replaces a specific occurrence of a string, for instance "foo" will only match and replace that: "foo". preg_replace will do regular expression matching, for instance "/f. {2}/" will match and replace "foo", but also "fey", "fir", "fox", "f12", etc.

What does Preg_replace do in PHP?

The preg_replace() function returns a string or array of strings where all matches of a pattern or list of patterns found in the input are replaced with substrings. There are three different ways to use this function: 1. One pattern and a replacement string.


1 Answers

I feel really stupid right about now but the problem wasn't with Preg_* functions at all. I don't know why but I first checked if the given term is even in the string with StriPos and since that function is not multi-byte safe it returned false if the case of the text was not the same as the search term, so the Preg_Replace wasn't even called.

So the lesson to be learned here is that always use multi-byte versions of functions if you have UTF8 strings.

like image 118
Jan Hančič Avatar answered Sep 22 '22 20:09

Jan Hančič