To quote section 2.3 of RFC 3986:
Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
ALPHA DIGIT "-" / "." / "_" / "~"
Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.
There are two sets of characters you need to watch out for: reserved and unsafe.
The reserved characters are:
The characters generally considered unsafe are:
I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.
In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.
A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;
Only safe when used within specific URL components; use with care.
Paths: + & =
Queries: ? /
Fragments: ? / # + & =
According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:
<space> <control-characters> <extended-ascii> <unicode>
% < > [ ] { } | \ ^
If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).
Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.
You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).
You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:
Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.
As with the other comments, check out the standards and specifications for complete details.
Looking at RFC3986 - Uniform Resource Identifier (URI): Generic Syntax, your question revolves around the path component of a URI.
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
Citing section 3.3, valid characters for a URI segment
are of type pchar
:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
Which breaks down to:
ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded
"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
":" / "@"
Or in other words: You may use any (non-control-) character from the ASCII table, except /
, ?
, #
, [
and ]
.
This understanding is backed by RFC1738 - Uniform Resource Locators (URL).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With