Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the safe characters for making URLs?

To quote section 2.3 of RFC 3986:

Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

  ALPHA  DIGIT  "-" / "." / "_" / "~"

Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.


There are two sets of characters you need to watch out for: reserved and unsafe.

The reserved characters are:

  • ampersand ("&")
  • dollar ("$")
  • plus sign ("+")
  • comma (",")
  • forward slash ("/")
  • colon (":")
  • semi-colon (";")
  • equals ("=")
  • question mark ("?")
  • 'At' symbol ("@")
  • pound ("#").

The characters generally considered unsafe are:

  • space (" ")
  • less than and greater than ("<>")
  • open and close brackets ("[]")
  • open and close braces ("{}")
  • pipe ("|")
  • backslash ("\")
  • caret ("^")
  • percent ("%")

I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.


Always Safe

In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.

    A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;

Sometimes Safe

Only safe when used within specific URL components; use with care.

    Paths:     + & =
    Queries:   ? /
    Fragments: ? / # + & =
    

Never Safe

According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:

    <space> <control-characters> <extended-ascii> <unicode>
    % < > [ ] { } | \ ^
    

If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).

Keep Context in Mind

Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.


You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).

You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:

  1. Lower case letters (convert upper case to lower)
  2. Numbers, 0 through 9
  3. A dash - or underscore _
  4. Tilde ~

Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.

As with the other comments, check out the standards and specifications for complete details.


Looking at RFC3986 - Uniform Resource Identifier (URI): Generic Syntax, your question revolves around the path component of a URI.

    foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment
      |   _____________________|__
     / \ /                        \
     urn:example:animal:ferret:nose

Citing section 3.3, valid characters for a URI segment are of type pchar:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Which breaks down to:

ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded

"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

":" / "@"

Or in other words: You may use any (non-control-) character from the ASCII table, except /, ?, #, [ and ].

This understanding is backed by RFC1738 - Uniform Resource Locators (URL).