Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

urlencode() the 'asterisk' (star?) character

I'm testing PHP urlencode() vs. Java java.net.URLEncoder.encode().

Java

String all = "";
for (int i = 32; i < 256; ++i) {
    all += (char) i;
}

System.out.println("All characters:         -||" + all + "||-");
try {
    System.out.println("Encoded characters:     -||" + URLEncoder.encode(all, "utf8") + "||-");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

PHP

$all = "";
for($i = 32; $i < 256; ++$i)
{
    $all = $all.chr($i);
}

echo($all.PHP_EOL);
echo(urlencode(utf8_encode($all)).PHP_EOL);

All characters seem to be encoded in the same way with both functions, except for the 'asterisk' character that is not encoded by Java, and translated to %2A by PHP. Which behaviour is supposed to be the 'right' one, if any?

Note: I tried with rawurlencode(), too - no luck.

like image 734
etienne Avatar asked Jun 30 '11 10:06

etienne


People also ask

Is * a valid URL character?

There are only certain characters that are allowed in the URL string, alphabetic characters, numerals, and a few characters ; , / ? : @ & = + $ - _ . ! ~ * ' ( ) # that can have special meanings.

What is Urlencode function in PHP?

The urlencode() function is an inbuilt function in PHP which is used to encode the url. This function returns a string which consist all non-alphanumeric characters except -_. and replace by the percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. Syntax: string urlencode( $input )

What is Urlencode in C#?

UrlEncode(String) Method (System.Net) Converts a text string into a URL-encoded string.

What does %20 replace in URL?

URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.


2 Answers

It is okay to have a * in a URL, (but it is also okay to have it in its encoded form).

RFC1738: Uniform Resource Locators (URL) states the following:

Reserved:

[...]

Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL.

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose.

like image 125
aioobe Avatar answered Oct 05 '22 22:10

aioobe


Wikipedia suggests that * is a reserved character when it comes to URIs, and that it must be encoded if not used for the reserved purpose. According to RFC3986, pages 12-13:

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

(The fact that the URL RFC still allows the * character to go unencoded is that is doesn't have a reserved purpose i URLs, and as such doesn't have to be encoded. So wether you have to encode it or not depends on what sort of URI you're creating.)

like image 29
You Avatar answered Oct 05 '22 23:10

You