Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Google+ parse URLs from posts?

Google+ seems to use The-King-of-URL-Regexes to parse the suckers out of user posts. It doesn't require protocols and is good about ignoring punctuation. For example: if I post "I like plus.google.com.", the site will transform that into "I like plus.google.com." So if anyone knows of a regex that can parse URLs both with and without protocols and is good at ignoring punctuation, please answer with it.

I don't think this question is a dupe, because all the answers I've seen to similar questions seem to require a protocol in the URL.

Thanks

like image 227
JoshNaro Avatar asked Oct 21 '22 21:10

JoshNaro


1 Answers

Here's a more complete (full URL) implementation. Note that it is non fully RFC 3986 compliant, missing some TLDs, allows some illegal country TLDs, allows dropping the protocol part (as requested in the original Q), and has some other imperfections. The upside is that it has a lot of simplicity and is much shorter than many other implementations and does >95% of the job.

#!/usr/bin/perl -w
# URL grammar, not 100% RFC 3986 but pretty good considering the simplicity.
# For more complete implementation options see:
#   http://mathiasbynens.be/demo/url-regex
#   https://gist.github.com/dperini/729294
#   https://github.com/garycourt/uri-js (RFC 3986 compliant)
#
my $Protocol = '(?:https?|ftp)://';
# Add more new TLDs for completeness
my $TLD = '(?:com|net|info|org|gov|edu|[a-z]{2})';
my $UserAuth = '(?:[^\s:@]+:[^\s@]*@)';
my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';
my $Port = '(?::\d+)';
my $Pathname = '/[^\s?#&]*';
my $Arg = '\w+(?:=[^\s&])*';
my $ArgList = "${Arg}(?:\&${Arg})*";
my $QueryArgs = '\?' . ${ArgList};
my $URL = qr/
    (?:${Protocol})?    # Optional, not per RFC!
    ${UserAuth}?
    ${HostName}
    ${Port}?
    (?:${Pathname})?
    (?:${QueryArgs})?
/sox;

while (<>) {
    while (/($URL)/g) {
         print "found URL: $&\n";
    }
}
like image 124
arielf Avatar answered Oct 31 '22 21:10

arielf