I've found this regex in a script I'm customizing. Can someone tell me what its doing?
function test( $text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Inside of the capturing group there are four options:
[\x00-\x7F]
[\xC0-\xDF][\x80-\xBF]
[\xE0-\xEF][\x80-\xBF]{2}
[\xF0-\xF7][\x80-\xBF]{3}
If none of these patterns are matched at a given location, then any character will be matched by the .
that is outside of the capturing group.
The preg_replace
call will iterate over $text
finding all non-overlapping matches, replacing each match with whatever was captured.
There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn't change $text
, or the .
at the end matched a single character and that character is removed from $text
.
Here are some basic examples:
\xF8-\xFF
appears in the text, it will always be removed\xC0-\xDF
will be removed unless followed by a character in \x80-\xBF
\xE0-\xEF
will be removed unless followed by two characters in \x80-\xBF
\xF0-\xF7
will be removed unless followed by three characters in \x80-\xBF
\x80-\xBF
will be removed unless it was matched as a part of one of the above casesThe purpose appears to be to "clean" UTF-8 encoded text. The part in the capturing group,
( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} )
...roughly matches a valid UTF-8 byte sequence, which may be one to four bytes long. The value of the first byte determines how long that particular byte sequence should be.
Since the replacement is simply, '$1'
, valid byte sequences will be plugged right back into the output. Any byte that's not matched by that part will instead be matched by the dot (.
), and effectively removed.
The most important thing to know about this technique is that you should never have to use it. If you find invalid UTF-8 byte sequences in your UTF-8 encoded text, it means one of two things: it's not really UTF-8, or it's been corrupted. Instead of "cleaning" it, you should find out how it got dirty and fix that problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With