Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What characters are valid in a URL? [duplicate]

Tags:

html

url

I'm trying to remove the non-URL part of a big string. Most of the regexes I found are like [A-Za-z0-9-_.!~*'()], but there are more things that can a url contain. Like http://127.0.0.1:8080/test?v=123#this for example

So what are the latest characters for a valid URL?

like image 407
blez Avatar asked Aug 18 '11 14:08

blez


People also ask

What are the valid characters in URL?

A URL is composed from a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ).

What characters are invalid in a URL?

These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL.


1 Answers

All the gory details can be found in the current RFC on the topic: RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)

Based on this related answer, you are looking at a list that looks like: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, and =. Everything else must be url-encoded. Also, some of these characters can only exist in very specific spots in a URI and outside of those spots must be url-encoded (e.g. % can only be used in conjunction with url encoding as in %20), the RFC has all of these specifics.

like image 131
ckittel Avatar answered Oct 16 '22 21:10

ckittel