Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getaddrinfo error with Mechanize

I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

Does anyone know how I can get the script to make it through all of the records?

like image 742
EricM Avatar asked Nov 01 '12 22:11

EricM


1 Answers

I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

By setting keep_alive to false, the connection is immediately closed and cleaned up.

like image 188
EricM Avatar answered Sep 21 '22 03:09

EricM