I am trying to get what's inside of the title
tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.
This is what I am doing:
require "open-uri"
require "uri"
def browse startpage, depth, block
if depth > 0
begin
open(startpage){ |f|
block.call startpage, f
}
rescue
return
end
end
end
browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
puts "Header information:"
puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
puts "Base URI: #{web.base_uri}"
puts "Content Type: #{web.content_type}"
puts "Charset: #{web.charset}"
puts "-----------------------------"
})
The title output is just []
, why?
open
returns a File
object or passes it to the block (actually a Tempfile
but that doesn't matter). Calling to_s
just returns a string containing the object's class and its id:
open('https://www.ruby-lang.org/es/') do |f|
f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"
Scanning that string for a title is obviously useless:
"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)
Instead, you have to read
the file's content:
open('https://www.ruby-lang.org/es/') do |f|
f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"
You can now scan the content for a <title>
tag:
open('https://www.ruby-lang.org/es/') do |f|
str = f.read
str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]
or, using Nokogiri: (because You can't parse [X]HTML with regex)
open('https://www.ruby-lang.org/es/') do |f|
doc = Nokogiri::HTML(f)
doc.at_css('title').text
end
#=> "Lenguaje de Programación Ruby"
If you must insist on using open-uri
, this one liner than get you the page title:
2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de Programación Ruby
=> nil
If you want to use something more complicated than this, please use nokogiri
or mechanize
. Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With