Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mechanize - How to follow or "click" Meta refreshes in rails

I have a bit trouble with Mechanize.

When a submit a form with Mechanize. I am come to a page with one meta refresh and there is no links.

My question is how do i follow the meta refresh?

I have tried to allow meta refresh but then i get a socket error. Sample code

require 'mechanize'
agent = WWW::Mechanize.new
agent.get("http://euroads.dk")
form = agent.page.forms.first
form.username = "username"
form.password = "password"
form.submit
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
agent.page.body

The response:

<html>
 <head>
   <META HTTP-EQUIV=\"Refresh\" CONTENT=\"0;URL=index.php?showpage=m_frontpage\">
 </head>
</html>

Then I try:

redirect_url = page.parser.at('META[HTTP-EQUIV=\"Refresh\"]')[
  "0;URL=index.php?showpage=m_frontpage\"][/url=(.+)/, 1]

But I get:

NoMethodError: Undefined method '[]' for nil:NilClass
like image 575
Rails beginner Avatar asked Feb 15 '11 12:02

Rails beginner


2 Answers

Internally, Mechanize uses Nokogiri to handle parsing of the HTML into a DOM. You can get at the Nokogiri document so you can use either XPath or CSS accessors to dig around in a returned page.

This is how to get the redirect URL with Nokogiri only:

require 'nokogiri'

html = <<EOT
<html>
  <head>
    <meta http-equiv="refresh" content="2;url=http://www.example.com/">
    </meta>
  </head>
  <body>
    foo
  </body>
</html>
EOT

doc = Nokogiri::HTML(html)
redirect_url = doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
redirect_url # => "http://www.example.com/"

doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1] breaks down to: Find the first occurrence (at) of the CSS accessor for the <meta> tag with an http-equiv attribute of refresh. Take the content attribute of that tag and return the string following url=.

This is some Mechanize code for a typical use. Because you gave no sample code to base mine on you'll have to work from this:

agent = Mechanize.new
page = agent.get('http://www.examples.com/')
redirect_url = page.parser.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
page = agent.get(redirect_url)

EDIT: at('META[HTTP-EQUIV=\"Refresh\"]')

Your code has the above at(). Notice that you are escaping the double-quotes inside a single-quoted string. That results in a backslash followed by a double-quote in the string which is NOT what my sample uses, and is my first guess for why you're getting the error you are. Nokogiri can't find the tag because there is no <meta http-equiv=\"Refresh\"...>.

EDIT: Mechanize has a built-in way to handle meta-refresh, by setting:

 agent.follow_meta_refresh = true

It also has a method to parse the meta tag and return the content. From the docs:

parse(content, uri)

Parses the delay and url from the content attribute of a meta tag. Parse requires the uri of the current page to infer a url when no url is specified. If a block is given, the parsed delay and url will be passed to it for further processing. Returns nil if the delay and url cannot be parsed.

# <meta http-equiv="refresh" content="5;url=http://example.com/" />
uri = URI.parse('http://current.com/')

Meta.parse("5;url=http://example.com/", uri)  # => ['5', 'http://example.com/']
Meta.parse("5;url=", uri)                     # => ['5', 'http://current.com/']
Meta.parse("5", uri)                          # => ['5', 'http://current.com/']
Meta.parse("invalid content", uri)            # => nil
like image 109
the Tin Man Avatar answered Nov 15 '22 04:11

the Tin Man


Mechanize treats meta refresh elements just like links without text. Thus, your code can be as simple as this:

page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
page.meta_refresh.first.click
like image 25
Earl Jenkins Avatar answered Nov 15 '22 03:11

Earl Jenkins