Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When, if ever, should characters like { and } (curly braces) be percent-encoded in URLs?

According to RFC 3986 the following characters are reserved and need to be percent-encoded in order to be used in a URI other than as their reserved uses: :/?#[]@!$&'()*+,;=

Furthermore it specifies some characters that are specifically unreserved: a-zA-Z0-9\-._~

It seems clear that generally one should encode reserved characters (to prevent misinterpretation) and not encode unreserved characters (for readability), but how should characters that do not fall into either category be handled? For example { and } do not appear in either list, but they are standard ASCII characters.

Looking to modern browsers for guidance, it seems they sometimes have different behaviors. For example, consider pasting the URL https://www.google.com/search?q={ into the address bar of a web browser:

  • Chrome 34.0.1847.116 m does not change it.
  • Firefox 28.0 does not change it.
  • Internet Explorer 9.0 does not change it.
  • Safari 5.1.7 changes it to https://www.google.com/search?q=%7B

However, if one pastes https://www.google.com/#q={ (removing "search" and changing the ? to a #, making the character part of the fragment/hash rather than the query string) we find that:

  • Chrome 34.0.1847.116 m changes it to https://www.google.com/#q=%7B (via JavaScript)
  • Firefox 28.0 does not change it.
  • Internet Explorer 9.0 does not change it.
  • Safari 5.1.7 changes it to https://www.google.com/#q=%7B (before executing JavaScript)

Furthermore, when using JavaScript to perform the request asynchronously (i.e. using this MDN example modified to use a URL of ?q={), the URL is not percent-encoded automatically. (I'm guessing this is because the XMLHttpRequest API assumes that the URL be encoded/escaped beforehand.)

I would like to (for a reason related to a bizarre customer requirement) use { and } in the filename portion of URLs without (1) breaking things and ideally also without (2) creating ugly-looking percent-encoded entries in the network panel of modern browsers' web inspectors/debuggers.

like image 970
jacobq Avatar asked Apr 14 '14 15:04

jacobq


1 Answers

(RFC 2396)

You should be encoding any of the unwise section and the rfc gives the reason.


additional information from the RFC

Account for < > # % primarily any control characters 00-1F and 7F

also marked as unwise in the rfc: " { } | \ ^ [ ] `

if you are intending to allow for # to be in the querystring values then that's a special case, because a # is a fragment identifier of a uri.

Some characters which do not have to be encoded, are accepted either encoded or not such as ~

There are 2 generally accepted encodings for (space) %20 and +

Here's a fiddle with some of the test cases I'm using.

like image 107
Maslow Avatar answered Sep 22 '22 22:09

Maslow