Sanitizing data means removing any illegal character from the data. Sanitizing user input is one of the most common tasks in a web application. To make this task easier PHP provides native filter extension that you can use to sanitize the data such as e-mail addresses, URLs, IP addresses, etc.
We can sanitize a URL by using FILTER_SANITIZE_URL. This function removes all chars except letters, digits and $-_. +! *'(),{}|\\^~[]`<>#%";/?:@&=.
The FILTER_SANITIZE_EMAIL filter removes all illegal characters from an email address.
According to PHP Manual:
Strip tags, optionally strip or encode special characters.
According to W3Schools:
The FILTER_SANITIZE_STRING
filter strips or encodes unwanted characters.This filter removes data that is potentially harmful for your application. It is used to strip tags and remove or encode unwanted characters.
Now, that doesn't tell us much. Let's go see some PHP sources.
ext/filter/filter.c
:
static const filter_list_entry filter_list[] = {
/*...*/
{ "string", FILTER_SANITIZE_STRING, php_filter_string },
{ "stripped", FILTER_SANITIZE_STRING, php_filter_string },
{ "encoded", FILTER_SANITIZE_ENCODED, php_filter_encoded },
/*...*/
Now, let's go see how php_filter_string
is defined.ext/filter/sanitizing_filters.c
:
/* {{{ php_filter_string */
void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
{
size_t new_len;
unsigned char enc[256] = {0};
/* strip high/strip low ( see flags )*/
php_filter_strip(value, flags);
if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
enc['\''] = enc['"'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_AMP) {
enc['&'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_LOW) {
memset(enc, 1, 32);
}
if (flags & FILTER_FLAG_ENCODE_HIGH) {
memset(enc + 127, 1, sizeof(enc) - 127);
}
php_filter_encode_html(value, enc);
/* strip tags, implicitly also removes \0 chars */
new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, NULL, 0, 1);
Z_STRLEN_P(value) = new_len;
if (new_len == 0) {
zval_dtor(value);
if (flags & FILTER_FLAG_EMPTY_STRING_NULL) {
ZVAL_NULL(value);
} else {
ZVAL_EMPTY_STRING(value);
}
return;
}
}
I'll skip commenting flags since they're already explained on the Internet, like you said, and focus on what is always performed instead, which is not so well documented.
First - php_filter_strip
. It doesn't do much, just takes the flags you pass to the function and processes them accordingly. It does the well-documented stuff.
Then we construct some kind of map and call php_filter_encode_html
. It's more interesting: it converts stuff like "
, '
, &
and chars with their ASCII codes lower than 32 and higher than 127 to HTML entities, so &
in your string becomes &
. Again, it uses flags for this.
Then we get call to php_strip_tags_ex
, which just strips HTML, XML and PHP tags (according to its definition in /ext/standard/string.c
) and removes NULL bytes, like the comment says.
The code that follows it is used for internal string management and doesn't really do any sanitization. Well, not exactly - passing undocumented flag FILTER_FLAG_EMPTY_STRING_NULL
will return NULL
if the sanitized string is empty, instead of returning just an empty string, but it's not really that much useful. An example:
var_dump(filter_var("yo", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("yo", FILTER_SANITIZE_STRING));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING));
→
string(2) "yo"
NULL
string(2) "yo"
string(0) ""
There isn't much more going on, so the manual was fairly correct - to sum it up:
FILTER_FLAG_NO_ENCODE_QUOTES
- This flag does not encode quotes.FILTER_FLAG_STRIP_LOW
- Strip characters with ASCII value below 32.FILTER_FLAG_STRIP_HIGH
- Strip characters with ASCII value above 127.FILTER_FLAG_ENCODE_LOW
- Encode characters with ASCII value below 32.FILTER_FLAG_ENCODE_HIGH
- Encode characters with ASCII value above 127.FILTER_FLAG_ENCODE_AMP
- Encode the & character to &
(not &
).FILTER_FLAG_EMPTY_STRING_NULL
- Return NULL
instead of empty strings.I wasn't sure if "stripping tags" means just the <
>
characters, and if it preserves content between tags, e.g. the string "Hello!" from <b>Hello!</b>
, so I decided to check. Here are the results, using PHP 7.1.5 (and Bash for the command line):
curl --data-urlencode 'my-input='\ '1. ASCII b/n 32 and 127: ABC abc 012 '\ '2. ASCII higher than 127: Çüé '\ '3. PHP tag: <?php $i = 0; ?> '\ '4. HTML tag: <script type="text/javascript">var i = 0;</script> '\ '5. Ampersand: & '\ '6. Backtick: ` '\ '7. Double quote: " '\ '8. Single quote: '"'" \ http://localhost/sanitize.php
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_NO_ENCODE_QUOTES);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_BACKTICK);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: 7. Double quote: " 8. Single quote: '
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
<?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_AMP);
1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
Also, for the flags FILTER_FLAG_STRIP_LOW & FILTER_FLAG_ENCODE_LOW, since my Bash doesn't display these characters, I checked using the bell character (, ASCII 007) and Restman Chrome extension that:

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With