Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP preg_match is mismatching a curly apostrophe with other types of curly quotes. How to avoid?

Tags:

regex

php

unicode

I have the following variable content:

$content_content = '“I can’t do it, she said.”';

I want to do a preg_match for every "word" in that, including the contractions, so I use preg_match as follows:

 if (preg_match_all('/([a-zA-Z0-9’]+)/', $content_content, $matches))
 {
    echo '<pre>';
    print_r($matches);
    echo '</pre>';
 }

However, it seems by including ’ in the regular expression, it's also trapping the curly double quotes, as the above command outputs:

Array
(
    [0] => Array
        (
            [0] => ��
            [1] => I
            [2] => can’t
            [3] => do
            [4] => it
            [5] => she
            [6] => said
            [7] => ��
        )

    [1] => Array
        (
            [0] => ��
            [1] => I
            [2] => can’t
            [3] => do
            [4] => it
            [5] => she
            [6] => said
            [7] => ��
        )

)

How can I include ’ without it also including the “ and ”?

like image 607
jaydisc Avatar asked Feb 11 '23 10:02

jaydisc


1 Answers

This is because the "fancy" apostrophe you're using inside the character set is treated in its binary form; you need to enable Unicode mode using its respective modifier:

preg_match_all('/([a-zA-Z0-9’]+)/u', $content_content, $matches)

Demo

like image 166
Ja͢ck Avatar answered Feb 14 '23 00:02

Ja͢ck