Extracting URLs from a String that do not contain 'http'

Question

I have the following 3 strings...

a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"

Ruby's URI extract method only returns the URL in the third string, because it contains the http part.

URI.extract(a)
=> []

URI.extract(b)
=> []

URI.extract(c)
=> ["http://www.google.com"]

How can I create a method to detect and return the URL in all 3 instances?

Sucrenoir · Accepted Answer

Use regular expressions :

Here is a basic one that should work for most cases :

/(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s

This will only fetch the first url in the string and return a string.

threedaymonk · Answer

There's no perfect solution to this problem: it's fraught with edge cases. However, you might be able to get tolerably good results using something like the regular expressions used by Twitter to extract URLs from tweets (stripping off the extra leading spaces is left as an exercise!):

require './regex.rb'

def extract_url(s)
  s[Twitter::Regex[:valid_url]]
end

a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"

extract_url(a)
# => " www.google.com"
extract_url(b)
# => " google.com"
extract_url(c)
# => " http://www.google.com"

sawa · Answer

You seem to be satisfied with Sucrenoir's answer. The essence of Sucrenoir's answer is to identity a URL by assuming that it includes at least one period. if that is the case, Sucrenoir's regex can be simplified (not equivalently, but for the most part) to this:

string[/\S+\.\S+/]

Extracting URLs from a String that do not contain 'http'

Tags:

ruby

ruby-on-rails

ruby-on-rails-3

tob88

3 Answers

Sucrenoir

threedaymonk

sawa

Recent Activity

Donate For Us

Extracting URLs from a String that do not contain 'http'

Tags:

ruby

ruby-on-rails

ruby-on-rails-3

tob88

3 Answers

Sucrenoir

threedaymonk

sawa

Related questions

Recent Activity

Donate For Us