I've found this regex in a script I'm customizing. Can someone tell me what its doing?
function test( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}
                Inside of the capturing group there are four options:
[\x00-\x7F][\xC0-\xDF][\x80-\xBF][\xE0-\xEF][\x80-\xBF]{2}[\xF0-\xF7][\x80-\xBF]{3}If none of these patterns are matched at a given location, then any character will be matched by the . that is outside of the capturing group.
The preg_replace call will iterate over $text finding all non-overlapping matches, replacing each match with whatever was captured.
There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn't change $text, or the . at the end matched a single character and that character is removed from $text.
Here are some basic examples:
\xF8-\xFF appears in the text, it will always be removed\xC0-\xDF will be removed unless followed by a character in \x80-\xBF
\xE0-\xEF will be removed unless followed by two characters in \x80-\xBF
\xF0-\xF7 will be removed unless followed by three characters in \x80-\xBF
\x80-\xBF will be removed unless it was matched as a part of one of the above casesThe purpose appears to be to "clean" UTF-8 encoded text. The part in the capturing group,
( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} )
...roughly matches a valid UTF-8 byte sequence, which may be one to four bytes long. The value of the first byte determines how long that particular byte sequence should be.
Since the replacement is simply, '$1', valid byte sequences will be plugged right back into the output.  Any byte that's not matched by that part will instead be matched by the dot (.), and effectively removed.
The most important thing to know about this technique is that you should never have to use it. If you find invalid UTF-8 byte sequences in your UTF-8 encoded text, it means one of two things: it's not really UTF-8, or it's been corrupted. Instead of "cleaning" it, you should find out how it got dirty and fix that problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With