I need to generate a <code>href</code> to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to <code>/some/path;element</code> should appear as <code><a href="/some/path%3Belement"></code> (I know that <code>path;element</code> represents a single entity). Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific). So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under <code>unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~")</code> class. So far so good. But what about the opposite case? RFC only mentions that percent (<code>%</code>) always needs encoding. But what about the others? Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket <code>(</code> does not necessarily need encoding but semicolon <code>;</code> does. If I don't encode it I end up looking for <code>/first</code>* when following <code><a href="/first;second"></code>. But following <code><a href="/first(second"></code> I always end up looking for <code>/first(second</code>, as expected. What confuses me is that both <code>(</code> and <code>;</code> are in the same <code>sub-delims</code> class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs? Now, what failed with Java libs. I have tried doing it like <code>new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()</code> but this gives <code>http://site/pa;th</code> which is no good. Similar results observed with: <ul> <li><code>javax.ws.rs.core.UriBuilder</code></li> <li> Spring's UriUtils - I have tried both <code>encodePath(String, String)</code> and <code>encodePathSegment(String, String)</code> </li> </ul> [*] <code>/first</code> is a result of call to <code>HttpServletRequest.getServletPath()</code> in the server side when clicking on <code><a href="/first;second"></code> EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.

<blockquote> Is it correct to assume that everything that is not unreserved, can/should be percent-encoded? </blockquote> No. RFC 3986 says this: <blockquote> "Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. " </blockquote> The implication is that you decide which of the delimiters (i.e. the <code><delimiter></code> characters) need to be encoded depending on the context. Those which don't need to be encode shouldn't be encoded. For instance, you should not percent-encode a <code>/</code> if it appears in a path component, but you should percent-encode it when it appears in a query or fragment. So, in fact, a <code>;</code> character (which is a member of <code><reserved></code> should not be automatically percent encoded. And indeed the java URL and URI classes won't do this; see the URI(...) javadoc, specifically step 7) for how the <code><path></code> component is handled. This is reinforced by this paragraph: <blockquote> "The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI." </blockquote> So this says that a URL containing a percent-encoded <code>;</code> is not the same as a URL that contains a raw <code>;</code>. And the last sentence implies that they should NOT be percent encoded or decoded automatically. <hr> Which leaves us with the question - why do you want <code>;</code> to be percent encoded? <blockquote> Let's say you have a CMS where people can create arbitrary pages having arbitrary paths. Later on, I need to generate href links to all pages in, for example, site map component. Therefore I need an algorithm to know which characters to escape. Semicolon has to be treated literally in this case and should be escaped. </blockquote> Sorry, but it does not follow that semicolon should be escaped. As far as the URL / URI spec is concerned, the <code>;</code> has no special meaning. It might have special meaning to a particular web server / web site, but in general (i.e. without specific knowledge of the site) you have no way of knowing this. <ul> <li>If the <code>;</code> does have special meaning in a particular URI, then if you percent-escape it, then you break that meaning. For instance, if the site uses <code>;</code> to allow a session token to be appended to the path, then percent-encoding will stop it from recognizing the session token ...</li> <li>If the <code>;</code> is simply a data character provided by some client, then if you percent encode it, you are potentially changing the meaning of URI. Whether this matters depends on what the server does; i.e. whether is decodes or not as part of the application logic.</li> </ul> What this means knowing the "right thing to do" requires intimate knowledge of what the URI means to the end user and/or the site. This would require advanced mind-reading technology to implement. My recommendation would be to get the CMS to solve it by suitably escaping any delimiters the URI paths before it delivers them to your software. The algorithm is necessarily going to be specific to the CMS and content delivery platform. It/they will be responding to requests for documents identified by the URLs and will need to know how to interpret them. (Supporting arbitrary people using arbitrary paths is a bit crazy. There have to be some limits. For instance, not even Windows allows you use a file separator character in a filename component. So you are going to have to have some boundaries somewhere. It is just a matter of deciding where they should be.)

RFC3986 - which pchars need to be percent-encoded?

Tags:

java

language-agnostic

rfc3986

rfc

I need to generate a href to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element should appear as <a href="/some/path%3Belement"> (I know that path;element represents a single entity).

Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).

So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") class. So far so good. But what about the opposite case? RFC only mentions that percent (%) always needs encoding. But what about the others?

Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket ( does not necessarily need encoding but semicolon ; does. If I don't encode it I end up looking for /first* when following <a href="/first;second">. But following <a href="/first(second"> I always end up looking for /first(second, as expected. What confuses me is that both ( and ; are in the same sub-delims class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?

Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th which is no good. Similar results observed with:

javax.ws.rs.core.UriBuilder
Spring's UriUtils - I have tried both encodePath(String, String) and encodePathSegment(String, String)

[*] /first is a result of call to HttpServletRequest.getServletPath() in the server side when clicking on <a href="/first;second">

EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.

962

asked May 06 '11 15:05

mindas

2 Answers

Is it correct to assume that everything that is not unreserved, can/should be percent-encoded?

No. RFC 3986 says this:

"Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. "

The implication is that you decide which of the delimiters (i.e. the <delimiter> characters) need to be encoded depending on the context. Those which don't need to be encode shouldn't be encoded.

For instance, you should not percent-encode a / if it appears in a path component, but you should percent-encode it when it appears in a query or fragment.

So, in fact, a ; character (which is a member of <reserved> should not be automatically percent encoded. And indeed the java URL and URI classes won't do this; see the URI(...) javadoc, specifically step 7) for how the <path> component is handled.

This is reinforced by this paragraph:

"The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI."

So this says that a URL containing a percent-encoded ; is not the same as a URL that contains a raw ;. And the last sentence implies that they should NOT be percent encoded or decoded automatically.

Which leaves us with the question - why do you want ; to be percent encoded?

Let's say you have a CMS where people can create arbitrary pages having arbitrary paths. Later on, I need to generate href links to all pages in, for example, site map component. Therefore I need an algorithm to know which characters to escape. Semicolon has to be treated literally in this case and should be escaped.

Sorry, but it does not follow that semicolon should be escaped.

As far as the URL / URI spec is concerned, the ; has no special meaning. It might have special meaning to a particular web server / web site, but in general (i.e. without specific knowledge of the site) you have no way of knowing this.

If the ; does have special meaning in a particular URI, then if you percent-escape it, then you break that meaning. For instance, if the site uses ; to allow a session token to be appended to the path, then percent-encoding will stop it from recognizing the session token ...
If the ; is simply a data character provided by some client, then if you percent encode it, you are potentially changing the meaning of URI. Whether this matters depends on what the server does; i.e. whether is decodes or not as part of the application logic.

What this means knowing the "right thing to do" requires intimate knowledge of what the URI means to the end user and/or the site. This would require advanced mind-reading technology to implement. My recommendation would be to get the CMS to solve it by suitably escaping any delimiters the URI paths before it delivers them to your software. The algorithm is necessarily going to be specific to the CMS and content delivery platform. It/they will be responding to requests for documents identified by the URLs and will need to know how to interpret them.

(Supporting arbitrary people using arbitrary paths is a bit crazy. There have to be some limits. For instance, not even Windows allows you use a file separator character in a filename component. So you are going to have to have some boundaries somewhere. It is just a matter of deciding where they should be.)

answered Oct 10 '22 08:10

Stephen C

The ABNF for an absolute path part:

 path-absolute = "/" [ segment-nz *( "/" segment ) ]
 segment       = *pchar
 segment-nz    = 1*pchar
 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
 pct-encoded   = "%" HEXDIG HEXDIG
 unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
 reserved      = gen-delims / sub-delims
 sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
               / "*" / "+" / "," / ";" / "="

pchar includes sub-delims so you would not have to encode any of these in the path part: :@-._~!$&'()*+,;=

I wrote my own URL builder which includes an encoder for the path - as always, caveat emptor.

answered Oct 10 '22 06:10

McDowell

Related questions
                            
                                Scala and Java Real-Time System [closed]
                            
                                Why does Java's invokevirtual need to resolve the called method's compile-time class?
                            
                                Can i use Spring on GAE?
                            
                                Productivity research material [closed]
                            
                                Which logging library to use for cross-language (Java, C++, Python) system
                            
                                Catching constraint violations in JPA 2.0
                            
                                Can I use multiple statements in a JDBC prepared query?
                            
                                Arima/Arma Time series Models in Java [closed]
                            
                                Guava concurrency tutorials/code [closed]
                            
                                Dynamic JPA Connection
                            
                                Is there a Java user management package similar to Django auth application?
                            
                                What are the implications of using SingletonEhCacheRegionFactory vs. EhCacheRegionFactory for Hibernate 2nd-level cache in a Web Application?
                            
                                Tomcat, keep session when moving from HTTPS to HTTP
                            
                                Cconvert Netbeans project to Eclipse project
                            
                                Easy REST resource versioning in JAX-RS based implementations?
                            
                                What bytecode library when controlling line numbers?
                            
                                Class definition inside method argument in Java?
                            
                                Loading generic service implementations via java.util.ServiceLoader
                            
                                GWT CellTable with checkbox selection and on row click event
                            
                                How to profile a distributed app in java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With