Google+ seems to use The-King-of-URL-Regexes to parse the suckers out of user posts. It doesn't require protocols and is good about ignoring punctuation. For example: if I post "I like plus.google.com.", the site will transform that into "I like plus.google.com." So if anyone knows of a regex that can parse URLs both with and without protocols and is good at ignoring punctuation, please answer with it.
I don't think this question is a dupe, because all the answers I've seen to similar questions seem to require a protocol in the URL.
Thanks
Here's a more complete (full URL) implementation. Note that it is non fully RFC 3986 compliant, missing some TLDs, allows some illegal country TLDs, allows dropping the protocol part (as requested in the original Q), and has some other imperfections. The upside is that it has a lot of simplicity and is much shorter than many other implementations and does >95% of the job.
#!/usr/bin/perl -w
# URL grammar, not 100% RFC 3986 but pretty good considering the simplicity.
# For more complete implementation options see:
# http://mathiasbynens.be/demo/url-regex
# https://gist.github.com/dperini/729294
# https://github.com/garycourt/uri-js (RFC 3986 compliant)
#
my $Protocol = '(?:https?|ftp)://';
# Add more new TLDs for completeness
my $TLD = '(?:com|net|info|org|gov|edu|[a-z]{2})';
my $UserAuth = '(?:[^\s:@]+:[^\s@]*@)';
my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';
my $Port = '(?::\d+)';
my $Pathname = '/[^\s?#&]*';
my $Arg = '\w+(?:=[^\s&])*';
my $ArgList = "${Arg}(?:\&${Arg})*";
my $QueryArgs = '\?' . ${ArgList};
my $URL = qr/
(?:${Protocol})? # Optional, not per RFC!
${UserAuth}?
${HostName}
${Port}?
(?:${Pathname})?
(?:${QueryArgs})?
/sox;
while (<>) {
while (/($URL)/g) {
print "found URL: $&\n";
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With