Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RFC3986 - which pchars need to be percent-encoded?

I need to generate a href to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element should appear as <a href="/some/path%3Belement"> (I know that path;element represents a single entity).

Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).

So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") class. So far so good. But what about the opposite case? RFC only mentions that percent (%) always needs encoding. But what about the others?

Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket ( does not necessarily need encoding but semicolon ; does. If I don't encode it I end up looking for /first* when following <a href="/first;second">. But following <a href="/first(second"> I always end up looking for /first(second, as expected. What confuses me is that both ( and ; are in the same sub-delims class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?

Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th which is no good. Similar results observed with:

  • javax.ws.rs.core.UriBuilder
  • Spring's UriUtils - I have tried both encodePath(String, String) and encodePathSegment(String, String)

[*] /first is a result of call to HttpServletRequest.getServletPath() in the server side when clicking on <a href="/first;second">

EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.

like image 962
mindas Avatar asked May 06 '11 15:05

mindas


People also ask

What Is percent encoding used for?

Percent-encoding is a mechanism to encode 8-bit characters that have specific meaning in the context of URLs. It is sometimes called URL encoding. The encoding consists of substitution: A '%' followed by the hexadecimal representation of the ASCII value of the replace character.

What RFC 3986?

RCF 3986 is the specification for URI syntax. An example of things which it defines is the % syntax for escaped characters in a URL.

Which characters are not allowed in URL?

These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL.


2 Answers

Is it correct to assume that everything that is not unreserved, can/should be percent-encoded?

No. RFC 3986 says this:

"Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. "

The implication is that you decide which of the delimiters (i.e. the <delimiter> characters) need to be encoded depending on the context. Those which don't need to be encode shouldn't be encoded.

For instance, you should not percent-encode a / if it appears in a path component, but you should percent-encode it when it appears in a query or fragment.

So, in fact, a ; character (which is a member of <reserved> should not be automatically percent encoded. And indeed the java URL and URI classes won't do this; see the URI(...) javadoc, specifically step 7) for how the <path> component is handled.

This is reinforced by this paragraph:

"The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI."

So this says that a URL containing a percent-encoded ; is not the same as a URL that contains a raw ;. And the last sentence implies that they should NOT be percent encoded or decoded automatically.


Which leaves us with the question - why do you want ; to be percent encoded?

Let's say you have a CMS where people can create arbitrary pages having arbitrary paths. Later on, I need to generate href links to all pages in, for example, site map component. Therefore I need an algorithm to know which characters to escape. Semicolon has to be treated literally in this case and should be escaped.

Sorry, but it does not follow that semicolon should be escaped.

As far as the URL / URI spec is concerned, the ; has no special meaning. It might have special meaning to a particular web server / web site, but in general (i.e. without specific knowledge of the site) you have no way of knowing this.

  • If the ; does have special meaning in a particular URI, then if you percent-escape it, then you break that meaning. For instance, if the site uses ; to allow a session token to be appended to the path, then percent-encoding will stop it from recognizing the session token ...

  • If the ; is simply a data character provided by some client, then if you percent encode it, you are potentially changing the meaning of URI. Whether this matters depends on what the server does; i.e. whether is decodes or not as part of the application logic.

What this means knowing the "right thing to do" requires intimate knowledge of what the URI means to the end user and/or the site. This would require advanced mind-reading technology to implement. My recommendation would be to get the CMS to solve it by suitably escaping any delimiters the URI paths before it delivers them to your software. The algorithm is necessarily going to be specific to the CMS and content delivery platform. It/they will be responding to requests for documents identified by the URLs and will need to know how to interpret them.

(Supporting arbitrary people using arbitrary paths is a bit crazy. There have to be some limits. For instance, not even Windows allows you use a file separator character in a filename component. So you are going to have to have some boundaries somewhere. It is just a matter of deciding where they should be.)

like image 51
Stephen C Avatar answered Oct 10 '22 08:10

Stephen C


The ABNF for an absolute path part:

 path-absolute = "/" [ segment-nz *( "/" segment ) ]
 segment       = *pchar
 segment-nz    = 1*pchar
 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
 pct-encoded   = "%" HEXDIG HEXDIG
 unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
 reserved      = gen-delims / sub-delims
 sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
               / "*" / "+" / "," / ";" / "="

pchar includes sub-delims so you would not have to encode any of these in the path part: :@-._~!$&'()*+,;=

I wrote my own URL builder which includes an encoder for the path - as always, caveat emptor.

like image 27
McDowell Avatar answered Oct 10 '22 06:10

McDowell