How does Google+ parse URLs from posts?

Question

Google+ seems to use The-King-of-URL-Regexes to parse the suckers out of user posts. It doesn't require protocols and is good about ignoring punctuation. For example: if I post "I like plus.google.com.", the site will transform that into "I like plus.google.com." So if anyone knows of a regex that can parse URLs both with and without protocols and is good at ignoring punctuation, please answer with it.

I don't think this question is a dupe, because all the answers I've seen to similar questions seem to require a protocol in the URL.

Thanks

I don't think this question is a dupe, because all the answers I've seen to similar questions seem to require a protocol in the URL.

Thanks

arielf · Accepted Answer

Here's a more complete (full URL) implementation. Note that it is non fully RFC 3986 compliant, missing some TLDs, allows some illegal country TLDs, allows dropping the protocol part (as requested in the original Q), and has some other imperfections. The upside is that it has a lot of simplicity and is much shorter than many other implementations and does >95% of the job.

#!/usr/bin/perl -w
# URL grammar, not 100% RFC 3986 but pretty good considering the simplicity.
# For more complete implementation options see:
#   http://mathiasbynens.be/demo/url-regex
#   https://gist.github.com/dperini/729294
#   https://github.com/garycourt/uri-js (RFC 3986 compliant)
#
my $Protocol = '(?:https?|ftp)://';
# Add more new TLDs for completeness
my $TLD = '(?:com|net|info|org|gov|edu|[a-z]{2})';
my $UserAuth = '(?:[^\s:@]+:[^\s@]*@)';
my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';
my $Port = '(?::\d+)';
my $Pathname = '/[^\s?#&]*';
my $Arg = '\w+(?:=[^\s&])*';
my $ArgList = "${Arg}(?:\&${Arg})*";
my $QueryArgs = '\?' . ${ArgList};
my $URL = qr/
    (?:${Protocol})?    # Optional, not per RFC!
    ${UserAuth}?
    ${HostName}
    ${Port}?
    (?:${Pathname})?
    (?:${QueryArgs})?
/sox;

while (<>) {
    while (/($URL)/g) {
         print "found URL: $&
";
    }
}

How does Google+ parse URLs from posts?

Tags:

regex

url

parsing

text-parsing

JoshNaro

1 Answers

arielf

Recent Activity

Donate For Us

How does Google+ parse URLs from posts?

Tags:

regex

url

parsing

text-parsing

JoshNaro

1 Answers

arielf

Related questions

Recent Activity

Donate For Us