Obviously $data is the string and we are removing the characters that satisfy the reg expression, but what characters are being specified by /[\xF0-\xF7].../ ?
preg_replace('/[\xF0-\xF7].../', '', $data)
Also what what is the significance of these characters being replaced?
Edit for bounty: specifically, what exploit is this trying to prevent from occurring? The data is later used in mysql queries (non-pdo), so I presume some kind of injection attack is involved with these characters perhaps? Or not? I am trying to understand the logic behind this line of code in a script I am reading.
It removes 4 byte sequence from unicode string. In these first byte is always [\xF0-\xF7]
and three dots are the rest of 3 bytes.
According to http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.
MySQL with utf8 encoding selected may truncate text at the point where the sequence appears and if error reporting isn't set to strict_trans_tables
it may do it silently instead of throwing errors like SQLSTATE[HY000]: General error: 1366 Incorrect string value:
.
See these for further reference:
Potentially truncating can lead to exploit.
For example, there is a website with user named admin
. Website allows anyone to register. Using truncated strings one probably will be able to insert another admin
with different email bypassing unique check. Then suspend account and try using restore procedure. It will issue a query like SELECT * FROM users WHERE name = 'admin'
and since original admin is the first record attacker will restore his password.
It's matching one of 8 byte values, plus any 3 characters following, and removing the block of 4 characters. That much you say you know already. Unfortunately, without more context, we can't tell you why these particular 8 bytes are significant. By themselves, they're harmless, regardless of what character glyph they stand for (character encoding). My best guess is that in the application this comes from there is some significance to these 8 characters as markers of some kind. 0xF0 is 11110xxx, the first byte of a 32 bit (4 byte) UTF-8 character, so perhaps it is to remove all 32 bit UTF-8 characters? Are 16 and 24 bit characters (110xxxxx and 1110xxxx first byte) similarly treated?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With