Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's this regex doing?

Tags:

regex

php

I've found this regex in a script I'm customizing. Can someone tell me what its doing?

function test( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}
like image 982
Scott B Avatar asked Oct 10 '22 05:10

Scott B


2 Answers

Inside of the capturing group there are four options:

  1. [\x00-\x7F]
  2. [\xC0-\xDF][\x80-\xBF]
  3. [\xE0-\xEF][\x80-\xBF]{2}
  4. [\xF0-\xF7][\x80-\xBF]{3}

If none of these patterns are matched at a given location, then any character will be matched by the . that is outside of the capturing group.

The preg_replace call will iterate over $text finding all non-overlapping matches, replacing each match with whatever was captured.

There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn't change $text, or the . at the end matched a single character and that character is removed from $text.

Here are some basic examples:

  • If a character in the range \xF8-\xFF appears in the text, it will always be removed
  • A character in \xC0-\xDF will be removed unless followed by a character in \x80-\xBF
  • A character in \xE0-\xEF will be removed unless followed by two characters in \x80-\xBF
  • A character in \xF0-\xF7 will be removed unless followed by three characters in \x80-\xBF
  • A character in \x80-\xBF will be removed unless it was matched as a part of one of the above cases
like image 178
Andrew Clark Avatar answered Oct 12 '22 23:10

Andrew Clark


The purpose appears to be to "clean" UTF-8 encoded text. The part in the capturing group,

( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} )

...roughly matches a valid UTF-8 byte sequence, which may be one to four bytes long. The value of the first byte determines how long that particular byte sequence should be.

Since the replacement is simply, '$1', valid byte sequences will be plugged right back into the output. Any byte that's not matched by that part will instead be matched by the dot (.), and effectively removed.

The most important thing to know about this technique is that you should never have to use it. If you find invalid UTF-8 byte sequences in your UTF-8 encoded text, it means one of two things: it's not really UTF-8, or it's been corrupted. Instead of "cleaning" it, you should find out how it got dirty and fix that problem.

like image 25
Alan Moore Avatar answered Oct 12 '22 23:10

Alan Moore