Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.6 urlib2 timeout issue

It seems I cannot get the urllib2 timeout to be taken into account. I did read - I suppose - all posts related to this topic and it seems I'm not doing anything wrong. Am I correct? Many thanks for your kind help.

Scenario:

I need to check for Internet connectivity before continuing with the remaining of a script. I then wrote a function (Net_Access), which is provided below.

  • When I execute this code with my LAN or Wifi interface connected, and by checking an existing hostname: all is fine as there is no error or problem, thus no timeout.
  • If I unplug my LAN connector or I check against a non-existent hostname, the timeout value seems to be ignored. What's wrong with my code please?

Some info:

  • Ubuntu 10.04.4 LTS (running into a VirtualBox v4.2.6 VM, Host OS is MAC OS X Lion)
  • cat /proc/sys/kernel/osrelease: 2.6.32-42-generic
  • Python 2.6.5

My code:

#!/usr/bin/env python

import socket
import urllib2

myhost = 'http://www.google.com'
timeout = 3

socket.setdefaulttimeout(timeout)
req = urllib2.Request(myhost)

try:
    handle = urllib2.urlopen(req, timeout = timeout)
except urllib2.URLError as e:
    socket.setdefaulttimeout(None)
    print ('[--- Net_Access() --- No network access')
else:
    print ('[--- Net_Access() --- Internet Access OK')

1) Working, with LAN connector plugged in

$ $ time ./Net_Access 
[--- Net_Access() --- Internet Access OK

real    0m0.223s
user    0m0.060s
sys 0m0.032s

2) Timeout not working, with LAN connector unplugged

$ time ./Net_Access 
[--- Net_Access() --- No network access

real    1m20.235s
user    0m0.048s
sys 0m0.060s

Added to original post: test results (using IP instead of FQDN)

As suggested by @unutbu (see comments) replacing the FQDN in myhost with an IP address fixes the problem: the timeout is taken into effect.

LAN connector plugged in...
$ time ./Net_Access [--- Net_Access() --- Internet Access OK

real    0m0.289s
user    0m0.036s
sys 0m0.040s

LAN connector unplugged...
$ time ./Net_Access [--- Net_Access() --- No network access

real    0m3.082s
user    0m0.052s
sys 0m0.024s

This is nice, but it means that timeout could only be used with IP and not FQDN. Weird...

Did someone found a way to use urllib2 timeout without getting into pre-DNS resolution and pass IP to the function, or are you first using socket to test connection and then fire urllib2 when you are sure that you can reach the target?

Many thanks.

like image 928
user1943566 Avatar asked Jan 02 '13 18:01

user1943566


2 Answers

If your problem is with DNS lookup taking forever (or just way too long) to time out when there's no network connectivity, then yes, this is a known problem, and there's nothing you can do within urllib2 itself to fix that.

So, is all hope lost? Well, not necessarily.

First, let's look at what's going on. Ultimately, urlopen relies on getaddrinfo, which (along with its relatives like gethostbyname) is notoriously the one critical piece of the socket API that can't be run asynchronously or interrupted (and on some platforms, it's not even thread-safe). If you want to trace through the source yourself, urllib2 defers to httplib for creating connections, which calls create_connection on socket, which calls socket_getaddrinfo on _socket, which ultimately calls the real getaddrinfo function. This is an infamous problem that affects every network client or server written in every language in the world, and there's no good, easy solution.

One option is to use a different higher-level library that's already solved this problem. I believe requests relies on urllib3 which ultimately has the same problem, but pycurl relies on libcurl, which, if built with c-ares, does name lookup asynchronously, and therefore can time it out.

Or, of course, you can use something like twisted or tornado or some other async networking library. But obviously rewriting all of your code to use a twisted HTTP client instead of urllib2 is not exactly trivial.

Another option is to "fix" urllib2 by monkeypatching the standard library. If you want to do this, there are two steps.

First, you have to provide a timeoutable getaddrinfo. You could do this by binding c-ares, or using ctypes to access platform-specific APIs like linux's getaddrinfo_a, or even looking up the nameservers and communicating with them directly. But the really simple way to do it is to use threading. If you're doing lots of these, you'll want to use a single thread or small threadpool, but for small-scale use, just spin off a thread for each call. A really quick-and-dirty (read: bad) implementation is:

def getaddrinfo_async(*args):
    result = None
    t = threading.Thread(target=lambda: result=socket.getaddrinfo(*args))
    t.start()
    t.join(timeout)
    if t.isAlive():
        raise TimeoutError(blahblahblah)
    return result

Next, you have to get all the libraries you care about to use this. Depending on how ubiquitous (and dangerous) you want your patch to be, you can replace socket.getaddrinfo itself, or just socket.create_connection, or just the code in httplib or even urllib2.

A final option is to fix this at a higher level. If your networking stuff is happening on a background thread, you can throw a higher-level timeout on the whole thing, and if it took more than timeout seconds to figure out whether it's timed out or not, you know it has.

like image 85
abarnert Avatar answered Nov 03 '22 00:11

abarnert


Perhaps try this:

import urllib2

def get_header(url):
    req = urllib2.Request(url)
    req.get_method = lambda : 'HEAD'
    try:
        response = urllib2.urlopen(req)
    except urllib2.URLError:
        # urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
        return False
    return True

url = 'http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.7.1.tar.bz2'
print(get_header(url))

When I unplug my network adapter, this prints False almost immediately, while under normal conditions this prints True.

I'm not sure why this works so quickly compared to your original code (even without needing to set the timeout parameter), but perhaps it will work for you too.


I did an experiment this morning which did result in get_header not returning immediately. I booted the computer with the router off. Then the router was turned on. Then networking and wireless was enabled through the Ubuntu GUI. This failed to establish a working connection. At this stage, get_header failed to return immediately.

So, here is a heavier-weight solution which calls get_header in a subprocess using multiprocessing.Pool. The object returned by pool.apply_async has a get method with a timeout parameter. If a result is not returned from get_header within the duration specified by timeout, the subprocess is terminated.

Thus, check_http should return a result within about 1 second, under all circumstances.

import multiprocessing as mp
import urllib2

def timeout_function(cmd, timeout = None, args = (), kwds = {}):
    pool = mp.Pool(processes = 1)
    result = pool.apply_async(cmd, args = args, kwds = kwds)
    try:
        retval = result.get(timeout = timeout)
    except mp.TimeoutError as err:
        pool.terminate()
        pool.join()
        raise
    else:
        return retval

def get_header(url):
    req = urllib2.Request(url)
    req.get_method = lambda : 'HEAD'
    try:
        response = urllib2.urlopen(req)
    except urllib2.URLError:
        return False
    return True

def check_http(url):
    try:
        response = timeout_function(
            get_header,
            args = (url, ),
            timeout = 1)
        return response
    except mp.TimeoutError:
        return False

print(check_http('http://www.google.com'))
like image 42
unutbu Avatar answered Nov 03 '22 01:11

unutbu