I'm trying to remove the non-URL part of a big string. Most of the regexes I found are like [A-Za-z0-9-_.!~*'()]
, but there are more things that can a url contain. Like http://127.0.0.1:8080/test?v=123#this
for example
So what are the latest characters for a valid URL?
A URL is composed from a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ).
These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL.
All the gory details can be found in the current RFC on the topic: RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)
Based on this related answer, you are looking at a list that looks like: A-Z
, a-z
, 0-9
, -
, .
, _
, ~
, :
, /
, ?
, #
, [
, ]
, @
, !
, $
, &
, '
, (
, )
, *
, +
, ,
, ;
, %
, and =
. Everything else must be url-encoded. Also, some of these characters can only exist in very specific spots in a URI and outside of those spots must be url-encoded (e.g. %
can only be used in conjunction with url encoding as in %20
), the RFC has all of these specifics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With