Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does FILTER_SANITIZE_STRING do?

People also ask

What is the use of PHP sanitize function?

Sanitizing data means removing any illegal character from the data. Sanitizing user input is one of the most common tasks in a web application. To make this task easier PHP provides native filter extension that you can use to sanitize the data such as e-mail addresses, URLs, IP addresses, etc.

How sanitize URL in PHP?

We can sanitize a URL by using FILTER_SANITIZE_URL. This function removes all chars except letters, digits and $-_. +! *'(),{}|\\^~[]`<>#%";/?:@&=.

What is Filter_sanitize_email?

The FILTER_SANITIZE_EMAIL filter removes all illegal characters from an email address.


According to PHP Manual:

Strip tags, optionally strip or encode special characters.

According to W3Schools:

The FILTER_SANITIZE_STRING filter strips or encodes unwanted characters.

This filter removes data that is potentially harmful for your application. It is used to strip tags and remove or encode unwanted characters.

Now, that doesn't tell us much. Let's go see some PHP sources.

ext/filter/filter.c:

static const filter_list_entry filter_list[] = {                                       
    /*...*/
    { "string",          FILTER_SANITIZE_STRING,        php_filter_string          },  
    { "stripped",        FILTER_SANITIZE_STRING,        php_filter_string          },  
    { "encoded",         FILTER_SANITIZE_ENCODED,       php_filter_encoded         },  
    /*...*/

Now, let's go see how php_filter_string is defined.
ext/filter/sanitizing_filters.c:

/* {{{ php_filter_string */
void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
{
    size_t new_len;
    unsigned char enc[256] = {0};

    /* strip high/strip low ( see flags )*/
    php_filter_strip(value, flags);

    if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
        enc['\''] = enc['"'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_AMP) {
        enc['&'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_LOW) {
        memset(enc, 1, 32);
    }
    if (flags & FILTER_FLAG_ENCODE_HIGH) {
        memset(enc + 127, 1, sizeof(enc) - 127);
    }

    php_filter_encode_html(value, enc);

    /* strip tags, implicitly also removes \0 chars */
    new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, NULL, 0, 1);
    Z_STRLEN_P(value) = new_len;

    if (new_len == 0) {
        zval_dtor(value);
        if (flags & FILTER_FLAG_EMPTY_STRING_NULL) {
            ZVAL_NULL(value);
        } else {
            ZVAL_EMPTY_STRING(value);
        }
        return;
    }
}

I'll skip commenting flags since they're already explained on the Internet, like you said, and focus on what is always performed instead, which is not so well documented.

First - php_filter_strip. It doesn't do much, just takes the flags you pass to the function and processes them accordingly. It does the well-documented stuff.

Then we construct some kind of map and call php_filter_encode_html. It's more interesting: it converts stuff like ", ', & and chars with their ASCII codes lower than 32 and higher than 127 to HTML entities, so & in your string becomes &#38;. Again, it uses flags for this.

Then we get call to php_strip_tags_ex, which just strips HTML, XML and PHP tags (according to its definition in /ext/standard/string.c) and removes NULL bytes, like the comment says.

The code that follows it is used for internal string management and doesn't really do any sanitization. Well, not exactly - passing undocumented flag FILTER_FLAG_EMPTY_STRING_NULL will return NULL if the sanitized string is empty, instead of returning just an empty string, but it's not really that much useful. An example:

var_dump(filter_var("yo", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("yo", FILTER_SANITIZE_STRING));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING));

string(2) "yo"
NULL
string(2) "yo"
string(0) ""

There isn't much more going on, so the manual was fairly correct - to sum it up:

  • Always: strip HTML, XML and PHP tags, strip NULL bytes.
  • FILTER_FLAG_NO_ENCODE_QUOTES - This flag does not encode quotes.
  • FILTER_FLAG_STRIP_LOW - Strip characters with ASCII value below 32.
  • FILTER_FLAG_STRIP_HIGH - Strip characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_LOW - Encode characters with ASCII value below 32.
  • FILTER_FLAG_ENCODE_HIGH - Encode characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_AMP - Encode the & character to &#38; (not &amp;).
  • FILTER_FLAG_EMPTY_STRING_NULL - Return NULL instead of empty strings.

I wasn't sure if "stripping tags" means just the < > characters, and if it preserves content between tags, e.g. the string "Hello!" from <b>Hello!</b>, so I decided to check. Here are the results, using PHP 7.1.5 (and Bash for the command line):

curl --data-urlencode 'my-input='\
'1. ASCII b/n 32 and 127: ABC abc 012 '\
'2. ASCII higher than 127: Çüé '\
'3. PHP tag: <?php $i = 0; ?> '\
'4. HTML tag: <script type="text/javascript">var i = 0;</script> '\
'5. Ampersand: & '\
'6. Backtick: ` '\
'7. Double quote: " '\
'8. Single quote: '"'" \
http://localhost/sanitize.php
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_NO_ENCODE_QUOTES);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_BACKTICK);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: &#195;&#135;&#195;&#188;&#195;&#169; 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_AMP);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: &#38; 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;

Also, for the flags FILTER_FLAG_STRIP_LOW & FILTER_FLAG_ENCODE_LOW, since my Bash doesn't display these characters, I checked using the bell character (, ASCII 007) and Restman Chrome extension that:

  • without either of these flags, the character is preserved
  • with FILTER_FLAG_STRIP_LOW, it is removed
  • with FILTER_FLAG_ENCODE_LOW, it is encoded to &#7;