Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 urllib.request will not close connections immediately

I've got the following code to run a continuous loop to fetch some content from a website:

from http.cookiejar import CookieJar
from urllib import request

cj = CookieJar()
cp = request.HTTPCookieProcessor(cj)
hh = request.HTTPHandler()
opener = request.build_opener(cp, hh)

while True:
    # build url
    req = request.Request(url=url)
    p = opener.open(req)
    c = p.read()
    # process c
    p.close()
    # check for abort condition, or continue

The contents are correctly read. But for some reason, the TCP connections won't close. I'm observing the active connection count from a dd-wrt router interface, and it goes up consistently. If the script continue to run, it'll exhaust the 4096 connection limit of the router. When this happens, the script simply enter waiting state (the router won't allow new connections, but timeout hasn't hit yet). After couple minutes, those connections will be closed and the script can resume again.

I was able to observe the state of those hanging connections from the router. They share the same state: TIME_WAIT .

I'm expecting this script to use no more than 1 TCP connection simultaneously. What am I doing wrong?

I'm using Python 3.4.2 on Mac OS X 10.10.

like image 267
He Shiming Avatar asked Nov 09 '14 08:11

He Shiming


People also ask

What does Urllib request return?

This function always returns an object which can work as a context manager and has the properties url, headers, and status. See urllib.

What is Urllib request Urlopen in Python?

Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Urllib is a package that collects several modules for working with URLs, such as: urllib.

What does Urllib Urlopen return?

The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.


1 Answers

Through some research, I discovered the cause of this problem: the design of TCP protocol . In a nutshell, when you disconnect, the connection isn't dropped immediately, it enters 'TIME_WAIT' state, and will time out after 4 minutes. Unlike what I was expecting, the connection doesn't immediately disappear.

According to this question, it's also not possible to forcefully drop a connection (without restarting the network stack).

It turns out in my particular case, like this question stated, a better option would be to use a persistent connection, a.k.a. HTTP keep-alive. As I'm querying the same server, this will work.

like image 177
He Shiming Avatar answered Oct 16 '22 21:10

He Shiming