We've run into an odd argument where I work, and I may be wrong on this, so this is why I am asking.
Our software outputs a directory to an Apache server that replaces an underscore with a %5F in the name of the directory.
For instance if the name of the directory was listed as a string in our software it would be: "andy_test", but then when the software outputs the directory to the Apache server, it would become "andy%5Ftest". Unfortunately, when you access the url on the server it ends up becoming "andy%255Ftest".
Somehow this seems wrong to me, once again the progression is:
I'm assuming that "%5" is encoding for underscore, and that "%25" is encoding for "%".
Now it would seem to me that the way that the directory name should be listed on the server would be just plain andy_test and if you were using an encoded URI then maybe you would end up with the "andy%5Ftest" to access the directory on the apache server.
I asked the guys on the backend about it, and they said that they were just: "encoding anything that was not a letter or a number.
So I guess I'm a bit confused on this. Can you tell me who is right, and direct me to some information on why?
URLs in the world wide web can only contain ASCII alphanumeric characters and some other safe characters like hyphen ( - ), underscore ( _ ), tilde ( ~ ), and dot ( . ). Alphabets / Digits / "-" / "_" / "~" / "." Any other character apart from the above list must be encoded.
There are only certain characters that are allowed in the URL string, alphabetic characters, numerals, and a few characters ; , / ? : @ & = + $ - _ . ! ~ * ' ( ) # that can have special meanings.
Underscores can't be used in domain names, as the underscore character isn't permitted. Google's web crawlers don't like complex URLs that are filled with unnecessary characters. If you aren't careful to encode special characters, the content management system that you're using will encode your file names for you.
You should not encode the directory names as you create them (as you suggested). Encoding should only happen at the last stage where it is handed out to the browser. That's why you are ending up with 'double' encoding: %25 is % and 5F is the leftover from the first encoding of underscore.
Also, note that you don't need to encode underscores according to rfc1738.
2.2. URL Character Encoding Issues
...
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
There is double encoding happening in what you are showing. Two steps should be enough:
andy_test
is both the string in the software and the actual name of the directory or script in the filesystem (the resource the web server accesses)
andy%5Ftest
is andy_test
URL encoded. This string should the browser use (it's not really needed in the underscore case, but may be in other cases).
andy%255ftest
is just andy_test
URL encoded twice, which makes no sense, there should be no need to. Just decide WHERE you will do the encoding. If you do it both at the code level and at the webserver level this is what can happen and the result is broken links unless you are decoding two times again, which is not really needed nor sane.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With