I've got the following code to run a continuous loop to fetch some content from a website:
from http.cookiejar import CookieJar
from urllib import request
cj = CookieJar()
cp = request.HTTPCookieProcessor(cj)
hh = request.HTTPHandler()
opener = request.build_opener(cp, hh)
while True:
# build url
req = request.Request(url=url)
p = opener.open(req)
c = p.read()
# process c
p.close()
# check for abort condition, or continue
The contents are correctly read. But for some reason, the TCP connections won't close. I'm observing the active connection count from a dd-wrt router interface, and it goes up consistently. If the script continue to run, it'll exhaust the 4096 connection limit of the router. When this happens, the script simply enter waiting state (the router won't allow new connections, but timeout hasn't hit yet). After couple minutes, those connections will be closed and the script can resume again.
I was able to observe the state of those hanging connections from the router. They share the same state: TIME_WAIT .
I'm expecting this script to use no more than 1 TCP connection simultaneously. What am I doing wrong?
I'm using Python 3.4.2 on Mac OS X 10.10.
This function always returns an object which can work as a context manager and has the properties url, headers, and status. See urllib.
Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Urllib is a package that collects several modules for working with URLs, such as: urllib.
The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.
Through some research, I discovered the cause of this problem: the design of TCP protocol . In a nutshell, when you disconnect, the connection isn't dropped immediately, it enters 'TIME_WAIT' state, and will time out after 4 minutes. Unlike what I was expecting, the connection doesn't immediately disappear.
According to this question, it's also not possible to forcefully drop a connection (without restarting the network stack).
It turns out in my particular case, like this question stated, a better option would be to use a persistent connection, a.k.a. HTTP keep-alive. As I'm querying the same server, this will work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With