Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting URLs from a String that do not contain 'http'

I have the following 3 strings...

a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"

Ruby's URI extract method only returns the URL in the third string, because it contains the http part.

URI.extract(a)
=> []

URI.extract(b)
=> []

URI.extract(c)
=> ["http://www.google.com"]

How can I create a method to detect and return the URL in all 3 instances?

like image 230
tob88 Avatar asked Jul 05 '13 13:07

tob88


3 Answers

Use regular expressions :

Here is a basic one that should work for most cases :

/(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s

This will only fetch the first url in the string and return a string.

like image 176
Sucrenoir Avatar answered Oct 15 '22 17:10

Sucrenoir


There's no perfect solution to this problem: it's fraught with edge cases. However, you might be able to get tolerably good results using something like the regular expressions used by Twitter to extract URLs from tweets (stripping off the extra leading spaces is left as an exercise!):

require './regex.rb'

def extract_url(s)
  s[Twitter::Regex[:valid_url]]
end

a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"

extract_url(a)
# => " www.google.com"
extract_url(b)
# => " google.com"
extract_url(c)
# => " http://www.google.com"
like image 8
threedaymonk Avatar answered Oct 15 '22 16:10

threedaymonk


You seem to be satisfied with Sucrenoir's answer. The essence of Sucrenoir's answer is to identity a URL by assuming that it includes at least one period. if that is the case, Sucrenoir's regex can be simplified (not equivalently, but for the most part) to this:

string[/\S+\.\S+/]
like image 1
sawa Avatar answered Oct 15 '22 15:10

sawa