Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can one avoid illegal characters when composing a URL?

Tags:

base64

I'm writing a web application that dynamically creates URL's based off of some input, to be consumed by a client at another time. For discussion sake these URL's can contain certain characters, like a forward slash (i.e. '/'), which should not be interpreted as part of the actual URL, but just as an argument. For example:

http://mycompany.com/PartOfUrl1/PartOfUrl2/ArgumentTo/Url/GoesHere

As you can see, the ArgumentTo/Url/GoesHere does indeed have forward slashes but these should be ignored or escaped.

This may be a bad example but the question in hand is more general and applies to other special characters.

So, if there are pieces of a URL that are just arguments and should not be used to resolve the actual web request, what's a good way of handling this?

Update:

Given some of the answers I realized that I failed to point out a few pieces that hopefully will help clarify.

I would like to keep this fairly language agnostic as it would be great if the client could just make a request. For example, if the client knew that it wanted to pass ArgumentTo/Url/GoesHere, it would be great if that could be encoded into a unique string in which the server could turn around and decode it to use.

Can we assume that similar functions like HttpUtility.HtmlEncode/HtmlDecode in the .NET Framework are available on other systems/platforms? The URL does not have to be pretty by any means so having real words in the path does not really matter.

Would something like a base64 encoding of the argument work?

It seems that base64 encoding/decoding is fairly readily available on any platform/language.

like image 815
Scott Saad Avatar asked Oct 27 '08 22:10

Scott Saad


2 Answers

You didn't say which language you're using, but PHP has the useful urlencode function and C# has HttpUtility.URLEncode and Server.UrlEncode which should encode parts of your URL nicely.

In case you need another way this page has a list of encoded values. E.g.: / == %2f.

update

From what you've updated I'd say use Voyagerfan's idea of URLRewriting to make something like:

http://www.example.com/([A-Za-z0-9/]+) http://www.example.com/?page=$1

And then use the applications GET parser to filter it out.

like image 164
Ross Avatar answered Sep 23 '22 17:09

Ross


You could use Apache rewrites to rewrite http:// mycompany.com/PartOfUrl1/PartOfUrl2 to http:// mycompany.com/path/to/program.php and then pass in ArgumentTo/Url/GoesHere as a standard GET parameter. So what the server actually sends back is the response for http:// mycompany.com/path/to/program.php?arg=ArgumentTo/Url/GoesHere

Rewriting is a good way to guard against technology changes (so switching from PHP to ASP, for example, won't change your URLs) and provide friendly URLs to your users at the same time.

Update

Using your example URLs and building on what I said before, I'd say to use this code in your httpd.conf or .htaccess:

RewriteEngine On

RewriteRule http:// mycompany.com/PartOfUrl1/PartOfUrl2/([A-Za-z0-9]) http://mycompany.com/path/to/program.php?arg=$1

(BTW, remove the space after the first http:// in the RewriteRule, plus that line needs to contain no line breaks.)

Changing the paths, the filenames, name of the arg, etc. is fine; the critical parts here are the regex (([A-Za-z0-9])) and the $1.

like image 37
dgw Avatar answered Sep 20 '22 17:09

dgw