If I want to create a URL using a variable I have two choices to encode the string. urlencode()
and rawurlencode()
.
What exactly are the differences and which is preferred?
Encodes text and merge field values for use in URLs by replacing characters that are illegal in URLs, such as blank spaces, with the code that represent those characters as defined in RFC 3986, Uniform Resource Identifier (URI): Generic Syntax.
urlencode () is the function that can be used conveniently to encode a string before using in a query part of a URL. This is a convenient way for passing variables to the next page. urldecode() is the function that is used to decode the encoded string.
This is a type of encoding-decoding approach where the built-in PHP functions urlencode() and urldecode()are implemented to encode and decode the URL, respectively. This encoding will replace almost all the special characters other than (_), (-), and (.) in the given URL.
A space may only be encoded to "+" in the "application/x-www-form-urlencoded" content-type key-value pairs query part of an URL.
It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).
rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see http://us2.php.net/manual/en/function.rawurlencode.php)
Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).
Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~
) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.
urlencode encodes spaces as plus signs (not as %20
as done in rawurlencode)(see http://us2.php.net/manual/en/function.urlencode.php)
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.
Additional Reading:
You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.
Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:
Within a query component, the characters
";", "/", "?", ":", "@",
are reserved.
"&", "=", "+", ",", and "$"
As you can see, the +
is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).
Proof is in the source code of PHP.
I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.
Download the source (or use http://lxr.php.net/ to browse it online), grep all the files for the function name, you'll find something such as this:
PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.
RawUrlEncode()
PHP_FUNCTION(rawurlencode) { char *in_str, *out_str; int in_str_len, out_str_len; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str, &in_str_len) == FAILURE) { return; } out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len); RETURN_STRINGL(out_str, out_str_len, 0); }
UrlEncode()
PHP_FUNCTION(urlencode) { char *in_str, *out_str; int in_str_len, out_str_len; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str, &in_str_len) == FAILURE) { return; } out_str = php_url_encode(in_str, in_str_len, &out_str_len); RETURN_STRINGL(out_str, out_str_len, 0); }
Okay, so what's different here?
They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode
So go look for those functions!
PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length) { register int x, y; unsigned char *str; str = (unsigned char *) safe_emalloc(3, len, 1); for (x = 0, y = 0; len--; x++, y++) { str[y] = (unsigned char) s[x]; #ifndef CHARSET_EBCDIC if ((str[y] < '0' && str[y] != '-' && str[y] != '.') || (str[y] < 'A' && str[y] > '9') || (str[y] > 'Z' && str[y] < 'a' && str[y] != '_') || (str[y] > 'z' && str[y] != '~')) { str[y++] = '%'; str[y++] = hexchars[(unsigned char) s[x] >> 4]; str[y] = hexchars[(unsigned char) s[x] & 15]; #else /*CHARSET_EBCDIC*/ if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) { str[y++] = '%'; str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4]; str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15]; #endif /*CHARSET_EBCDIC*/ } } str[y] = '\0'; if (new_length) { *new_length = y; } return ((char *) str); }
PHPAPI char *php_url_encode(char const *s, int len, int *new_length) { register unsigned char c; unsigned char *to, *start; unsigned char const *from, *end; from = (unsigned char *)s; end = (unsigned char *)s + len; start = to = (unsigned char *) safe_emalloc(3, len, 1); while (from < end) { c = *from++; if (c == ' ') { *to++ = '+'; #ifndef CHARSET_EBCDIC } else if ((c < '0' && c != '-' && c != '.') || (c < 'A' && c > '9') || (c > 'Z' && c < 'a' && c != '_') || (c > 'z')) { to[0] = '%'; to[1] = hexchars[c >> 4]; to[2] = hexchars[c & 15]; to += 3; #else /*CHARSET_EBCDIC*/ } else if (!isalnum(c) && strchr("_-.", c) == NULL) { /* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */ to[0] = '%'; to[1] = hexchars[os_toascii[c] >> 4]; to[2] = hexchars[os_toascii[c] & 15]; to += 3; #endif /*CHARSET_EBCDIC*/ } else { *to++ = c; } } *to = 0; if (new_length) { *new_length = to - start; } return (char *) start; }
One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L
in ASCII, it's actually a <
. I'm sure you see the confusion here.
Both of these functions manage EBCDIC if the web server has defined it.
Also, they both use an array of chars (think string type) hexchars
look-up to get some values, the array is described as such:
/* rfc1738: ...The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme... ...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL... For added safety, we only leave -_. unencoded. */ static unsigned char hexchars[] = "0123456789ABCDEF";
Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.
URLENCODE:
+
sign to the output string.isalnum(c)
), and also isn't and _
, -
, or .
character, then we , output a %
sign to array position 0, do an array look up to the hexchars
array for a lookup for os_toascii
array (an array from Apache that translates char to hex code) for the key of c
(the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded._-.
chars, it outputs exactly what it is.RAWURLENCODE:
Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x
and y
, checks for exit on len
reaching 0, and increments both x
and y
. I know, it's not what you'd expect, but it's valid code.
str
._-.
chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++
rather than to[1]
, this is because the strings are being built in different ways, but reach the same goal at the end anyway.\0
byte. Differences:
\0
byte to the string, RawUrlEncode does (this may be a moot point)They basically iterate differently, one assigns a + sign in the event of ASCII 20.
URLENCODE:
0
, with the exception of being a .
or -
, OR less than A
but greater than char 9
, OR greater than Z
and less than a
but not a _
. OR greater than z
(yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).RAWURLENCODE:
z
, it excludes ~
from the URL encode.\0
byte to the string before return.~
that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.+
, RawUrlEncode makes a space into %20
via array lookups.Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.
Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."
If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.
If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.
Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With