How can I extract a URL with non-English characters from a string?

Question

Here's a simple script that takes an anchor tag with a German URL in it, and extracts the URL:

# encoding: utf-8

require 'uri'

url = URI.extract('<a href="http://www.example.com/wp content/uploads/2012/01/München.jpg">München</a>')

puts url

http://www.example.com/wp-content/uploads/2012/01/M

The extract method stops at the ü. How can I get it to work with non-English letters? I'm using ruby-1.9.3-p0.

the Tin Man · Accepted Answer

Ruby's built-in URI is useful for some things, but it's not the best choice when dealing with international characters or IDNA addresses. For that I recommend using the Addressable gem.

This is some cleaned-up IRB output:

require 'addressable/uri'
url = 'http://www.example.com/wp content/uploads/2012/01/München.jpg'
uri = Addressable::URI.parse(url)

Here's what Ruby knows now:

#<Addressable::URI:0x102c1ca20
    @uri_string = nil,
    @validation_deferred = false,
    attr_accessor :authority = nil,
    attr_accessor :host = "www.example.com",
    attr_accessor :path = "/wp content/uploads/2012/01/München.jpg",
    attr_accessor :scheme = "http",
    attr_reader :hash = nil,
    attr_reader :normalized_host = nil,
    attr_reader :normalized_path = nil,
    attr_reader :normalized_scheme = nil
>

And looking at the path you can see it as is, or as it should be:

1.9.2-p290 :004 > uri.path            # => "/wp content/uploads/2012/01/München.jpg"
1.9.2-p290 :005 > uri.normalized_path # => "/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg"

Addressable really should be selected to replace Ruby's URI considering how the internet is moving to more complex URIs and mixed Unicode characters.

Now, getting at the string is easy too, but depends on how much text you have to look through.

If you have a full HTML document, your best bet is to use Nokogiri to parse the HTML and extract the href parameters from the <a> tags. This is where to start for a single <a>:

require 'nokogiri'
html = '<a href="http://www.example.com/wp content/uploads/2012/01/München.jpg">München</a>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

doc.at('a')['href'] # => "http://www.example.com/wp content/uploads/2012/01/München.jpg"

Parsing using DocumentFragment avoids wrapping the fragment in the usual <html><body> tags. For a full document you'd want to use:

doc = Nokogiri::HTML.parse(html)

Here's the difference between the two:

irb(main):006:0> Nokogiri::HTML::DocumentFragment.parse(html).to_html
=> "<a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a>"

versus:

irb(main):007:0> Nokogiri::HTML.parse(html).to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">
<html><body><a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a></body></html>
"

So, use the second for a full HTML document, and for a small, partial chunk, use the first.

To scan an entire document, extracting all the hrefs, use:

hrefs = doc.search('a').map{ |a| a['href'] }

If you only have small strings like you show in your example, you can consider using a simple regex to isolate the needed href:

html[/href="([^"]+)"/, 1]
=> "http://www.example.com/wp content/uploads/2012/01/München.jpg"

fuzzyalej · Answer

You have to encode the URL first:

URI.extract(URI.encode('<a href="http://www.example.com/wp_content/uploads/2012/01/München.jpg">München</a>'))

How can I extract a URL with non-English characters from a string?

Tags:

string

url

uri

ruby

ruby-on-rails

biodegabriel

2 Answers

the Tin Man

fuzzyalej

Recent Activity

Donate For Us

How can I extract a URL with non-English characters from a string?

Tags:

string

url

uri

ruby

ruby-on-rails

biodegabriel

2 Answers

the Tin Man

fuzzyalej

Related questions

Recent Activity

Donate For Us