Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Facebook's URL matching algorithm work? [duplicate]

You know how if you go to facebook.com and enter a URL into the status update textarea it will automatically be detected, and Facebook will display a little snapshot of data from that URL/link? Facebook doesn't even care if you enter a URL with or without a protocol like http://.

I'm looking to replicate this behavior. Right now I have this regular expression:

((?:https?:\/\/)?)((?:[a-zA-Z0-9\-]+\.)+(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2})(?:[a-z0-9\._\/~%\-\+&\#\?!=\(\)@]*)?(?:#?(?:[w]+)?)?)

And I use it to match URLs entered in a textarea. However, it has false positives; it'll match document.write(foo) which clearly isn't a URL.

Facebook doesn't seem to have this issue. In fact, I can type "yahoo.com " into Facebook's textarea and it'll recognize it as a URL. But if I type "example.com " it wont recognize it. So, this means Facebook must be doing something more than just regular expression matching. Or am I wrong about this?

In conclusion, I want to know what Facebook is doing, and I want to know how I can replicate it. Any ideas, tips or solutions is very much appreciated.

Thanks for reading.

like image 446
Sam Avatar asked Aug 17 '13 05:08

Sam


1 Answers

the simplest of regex to match any url is

[a-z_\.\-0-9]+\.[a-z]+

if this is present, do a lookup on the result. if the result fails, then it wasnt a url.

There is no save way to tell if a url is a url if its presented to you without the http:// prefix.

the regex will match stackoverflow.com in the following string ;

I always use stackoverflow.com to find the answers i need.

if you try "http://www." & regex.match.value you should get a valid url... or not.. You wont know until you do a lookup.

like image 177
Sedecimdies Avatar answered Sep 28 '22 00:09

Sedecimdies