This should be simple, but it's eluding me. There are many good and bad regex methods to match a URL, with or without the protocol, with or without www. The problem I have is this (in javascript): if I use regex to match URLs in a text string, and set it so that it will match just 'domain.com', it also catches the domain of an e-mail address (the part after '@'), which I don't want. A negative lookbehind solves it - but obviously not in JS.
This is my nearest success so far:
/^(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g
but that fails if the match is not at the start of the string. And I'm sure I'm tackling it the wrong way. Is there a simple answer out there anywhere?
EDIT: Revised regex to respond to a few of the comments below (sticks with 'www' rather than allowing sub-domains:
\b(www\.)?([^@])(\w*\.)(\w{2,3})(\.\w{2,3})?(\/\S*)?$
As mentioned in the comments however, this still matches the domain after a @.
Thanks
that fails if the match is not at the start of the string
it's because of the ^ at the beginning of the match:
/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g
js> "www.foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu [email protected]".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["foobar.com"]
though it's still matching a space before the domain. And it's making wrong assumptions about the domain…
xyz.example.org is a valid domain not matched by your regexp ;www.3x4mpl3.org is a valid domain not matched by your regexp ;example.co.uk is a valid domain not matched by your regexp ;ουτοπία.δπθ.gr is a valid domain not matched by your regexp.What defines a legal domain name? It's just a sequence of utf-8 characters separated by dots. It can't have two dots following each other, and the canonical name is \w\.\w\w (as I don't think a one letter tld exists).
Though, the way I'd do it is to simply match everything that looks like a domain, by taking everything that is text with a dot separator using word boundaries (\b):
/\b(\w+\.)+\w+\b/g
js> "aoe toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar"]
js> "aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]
and then make a second round to check whether the domain really exists or not in the list of domains found. The downside is that regexps in javascript can't check against unicode characters, and either \b or \w won't accept ουτοπία.δπθ.gr as a valid domain name.
In ES6, there's the /u modifier, which should work with latest browsers (but none that I have tested so far):
"ουτοπία.δπθ.gr aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/gu)
edit:
A negative lookbehind solves it - but obviously not in JS.
yes it will: for skipping all e-mail addresses, here's a working look behind implementation of the regex:
/(?![^@])?\b(\w+\.)+\w+\b/g
js> "aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar", "f00bar.com"]
though it's the same as unicode… it'll be there in JS soon…
the only way around there is, is to actually preserve the @ in the matched regexp, and discard any match that contains an @:
js> "toto.net aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]
or use the new list comprehension from ES6/JS1.7, which should be there in modern browsers…
[x for x of "toto.net aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g) if (!x.match(/@/))];
one final update:
/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g
> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe [email protected] toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
'11foo.com',
'toto.net',
'toto.example.org',
'foo.bar',
'f00bar.com' ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With