Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting parts of a URL (Regex)

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.


I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

For what it's worth, I found that I had to escape the forward slashes in JavaScript:

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?


I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

  1. It can not handle port number.
  2. The hash part is broken.

The following is a modified version:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

I needed a regular Expression to match all urls and made this one:

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

It matches all urls, any protocol, even urls like

ftp://user:[email protected]:8080/dir1/dir2/file.php?param1=value1#hashtag

The result (in JavaScript) looks like this:

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

An url like

mailto://[email protected]

looks like this:

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined] 

I was trying to solve this in javascript, which should be handled by:

var url = new URL('http://a:[email protected]:890/path/wah@t/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/bo@ng?bang');

since (in Chrome, at least) it parses to:

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&[email protected]",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:[email protected]:890/path/wah@t/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/bo@ng?bang"
}

However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.

The parts are in this order:

var keys = [
    "href",                    // http://user:[email protected]:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:[email protected]:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

There is also a small library which wraps it and provides query params:

https://github.com/sadams/lite-url (also available on bower)

If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.