Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does URL encoding exist for ASCII character set

It is clearly stated in W3Schools that

URLs can only be sent over the Internet using the ASCII character-set.

Why does URL encoding exist for ASCII characters like a , b , c when it can be sent over the internet without any URL encoding ???

Eg: Why encode 'a' when it can send over as 'a'

What are the possible reasons to encode ASCII characters ?? The only reason i can think of are hackers who are trying to make their URL as unreadable as possible to carry out XSS attacks

like image 409
Computernerd Avatar asked Dec 31 '13 08:12

Computernerd


People also ask

What is the purpose of URL encoding?

URL encoding converts characters into a format that can be transmitted over the Internet. URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.

What is the purpose of the ASCII character set?

ASCII stands for American Standard Code for Information Exchange. The purpose of ASCII is to create a standard for character-sets used in electronic equipments. The standard ensures that different devices (which might be manufactured by differing companies) can communicate to each other with the same character-code.

What is the problem with the ASCII character set?

Limitation of ASCII The 128 or 256 character limits of ASCII and Extended ASCII limits the number of character sets that can be held. Representing the character sets for several different language structures is not possible in ASCII, there are just not enough available characters.

Can URL have non-ascii characters?

When generating a URL, only ASCII symbols are allowed to be used. An example of a non-ASCII character is the Ñ. The URL can't contain any non-ASCII character or even a space.


2 Answers

STD 66, Percent-Encoding:

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.

So percent-encoding is a kind of escape mechanism: Some characters have a special meaning in URI components (→ they are reserved). If you want to use such a character without its special meaning, you percent-encode it.

Unreserved characters (like a, b, c, …) can always be used directly, but it’s also allowed to percent-encode them. Such URIs would be equivalent:

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.

Why it’s allowed to percent-encode unreserved characters in the first place? The obsolete RFC 2396 contains (bold by me):

Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

I can’t think of an example for such a "context", but this sentence suggests that there may be some.

Also, maybe some people/implementations like to simply percent-encode everything (except for delimiters etc.), so they don’t have to check if/which characters would need percent-encoding in the corresponding component.

like image 50
unor Avatar answered Nov 11 '22 04:11

unor


URL encoding exists for the full range of ASCII because it was easier to define an encoding that works for all characters than to define one that only works for the set of characters with special meanings.

like image 34
Mark Avatar answered Nov 11 '22 04:11

Mark