Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to URI.parse that allows hostnames to contain an underscore

Tags:

uri

ruby

I'm using DMOZ's list of url topics, which contains some urls that have hostnames that contain an underscore.

For example:

608  <ExternalPage about="http://outer_heaven4.tripod.com/index2.htm">
609    <d:Title>The Outer Heaven</d:Title>
610    <d:Description>Information and image gallery of McFarlane's action figures for Trigun, Akira, Tenchi Muyo and other Japanese Sci-Fi animations.</d:Description>
611    <topic>Top/Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures</topic>
612  </ExternalPage>

While this url will work in a web browser (or, at least, it does in mine :p), it's not legal according to the standard:

a hostname may not contain other characters, such as the underscore character (_),

which causes errors when trying to parse such URL with URI.parse:

[2] pry(main)> require 'uri'
=> true
[3] pry(main)> URI.parse "http://outer_heaven4.tripod.com/index2.htm"
URI::InvalidURIError: the scheme http does not accept registry part: outer_heaven4.tripod.com (or bad hostname?)
from ~/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'

Is there an alternative to URI.parse I can use that has lower strictness without just rolling my own?

like image 600
rampion Avatar asked Nov 02 '12 14:11

rampion


1 Answers

Try Addressable::URI. It follows the RFCs more closely than URI and is very flexible.

require 'addressable/uri'
uri = Addressable::URI.parse('http://outer_heaven4.tripod.com/index2.htm') 
uri.host 
=> "outer_heaven4.tripod.com"

I've used it for some projects and have been happy with it. URI is getting a bit... rusty and is in need of TLC. Other's have commented on it too:

http://www.cloudspace.com/blog/2009/05/26/replacing-rubys-uri-with-addressable/

There was quite a discussion about URI's state several years ago among the Ruby developers. I can't find the link to it right now, but there was a recommendation that Addressable::URI be used as a replacement. I don't know if someone stepped up to take over URI development, or where things stand right now. In my own code I continue to use URI for simple things and switch to Addressable::URI when URI proves to do the wrong thing for me.

like image 57
the Tin Man Avatar answered Oct 02 '22 16:10

the Tin Man