I have the following code that has worked for about a year: <pre class="prettyprint"><code>import urllib2 req = urllib2.Request('https://somewhere.com','<Request></Request>') data = urllib2.urlopen(req) print data.read() </code></pre> Lately, there have been some random errors: <ul> <li><code>urllib2.URLError: <urlopen error [Errno 111] Connection refused></code></li> <li><code><urlopen error [Errno 110] Connection timed out></code></li> </ul> The trace of the failure is: <pre class="prettyprint lang-none prettyprint-override"><code>Traceback (most recent call last): File "test.py", line 4, in <module> data = urllib2.urlopen(req).read() File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open return self.do_open(httplib.HTTPSConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open raise URLError(err) urllib2.URLError: <urlopen error [Errno 111] Connection refused> </code></pre> The above errors happen randomly, the script can run successfully the first time but then fails on the second run and vice versa. What should I do to debug and figure out where the issue is coming from? How can I tell if the endpoint has consumed my request and returned a response but never reached me? <h3>With telnet</h3> I just tested with telnet, sometimes it succeeds, sometimes it doesn't, just like my Python. On success: <pre class="prettyprint lang-none prettyprint-override"><code>$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... Connected to somewhere.com. Escape character is '^]'. Connection closed by foreign host. </code></pre> On a refused connection: <pre class="prettyprint lang-none prettyprint-override"><code>$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection refused </code></pre> On a timeout: <pre class="prettyprint lang-none prettyprint-override"><code>$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection timed out </code></pre>

<h2 id="the-problem">The problem</h3> The problem is in the network layer. Here are the status codes explained: <ul> <li> <code>Connection refused</code>: The peer is not listening on the respective network port you're trying to connect to. This usually means that either a firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. </li> <li> <code>Connection timed out</code>: During the attempt to establish the TCP connection, no response came from the other side within a given time limit. In the context of urllib this may also mean that the HTTP response did not arrive in time. This is sometimes also caused by firewalls, sometimes by network congestion or heavy load on the remote (or even local) site. </li> </ul> <h2 id="in-context">In context</h3> That said, it is probably not a problem in your script, but on the remote site. If it's occuring occasionally, it indicates that the other site has load problems or the network path to the other site is unreliable. Also, as it is a problem with the network, you cannot tell what happened on the other side. It is possible that the packets travel fine in the one direction but get dropped (or misrouted) in the other. It is also not a (direct) DNS problem, that would cause another error (Name or service not known or something similar). It could however be the case that the DNS is configured to return different IP addresses on each request, which would connect you (DNS caching left aside) to different addresses hosts on each connection attempt. It could in turn be the case that some of these hosts are misconfigured or overloaded and thus cause the aforementioned problems. <h2 id="debugging-this">Debugging this</h3> As suggested in the another answer, using a packet analyzer can help to debug the issue. You won't see much however except the packets reflecting exactly what the error message says. To rule out network congestion as a problem you could use a tool like <code>mtr</code> or <code>traceroute</code> or even <code>ping</code> to see if packets get lost to the remote site (see below though). If network congestion is not a problem (i.e. not more than, say, 1% of the packets get lost), you should contact the remote server administrator to figure out what's wrong. He may be able to see relevant infos in system logs. Running a packet analyzer on the remote site might also be more revealing than on the local site. Checking whether the port is open using <code>netstat -tlp</code> is definetly recommended then. <h2 id="interpreting-traceroute-results">Interpreting traceroute results</h3> This takes some practice, because high latency or loss at an intermediate hop may mean everything or nothing. Intermediate hops are typically big routers in the internet or the ISPs network which deal with a lot of packets. They may have better things to do than replying to your traceroute, so they may choose to only reply to 10% of the requests if they are very busy currently. Or choose not to reply at all. If you do not see loss at your last hop, you are probably fine loss-wise. However, if you do see loss at the last hop, you cannot be sure that the packet really got lost at the last hop. Any of the intermediate hops may be responsible. Typically, you'll also see loss at earlier hops then, which may indicate the real source. To add insult to injury, it is possible that the route you see is not the real route: The real route may be asymmetric, meaning that the to your destination (which is what you see in traceroute) takes a different path than the reply (which you cannot see in traceroute due to how it works). To summarize: <ul> <li>Loss observed in traceroute can only be caused by a hop equal to or before the hop which you see.</li> <li>Loss at an intermediate hop, without end-to-end loss, may just mean that the hop does not bother to reply.</li> <li>Forward path (what you see in traceroute) may be unequal to the reverse path; loss and latency may occur at the reverse path.</li> <li>Partial loss (1%-90%) which starts in the middle of a route (and affects all later hops) typically indicates network congestion. Typically, you won't be able to do anything about it.</li> </ul>

How can I debug what is causing a connection refused or a connection time out?

Tags:

python

networking

I have the following code that has worked for about a year:

import urllib2  req = urllib2.Request('https://somewhere.com','<Request></Request>') data = urllib2.urlopen(req) print data.read()

Lately, there have been some random errors:

urllib2.URLError: <urlopen error [Errno 111] Connection refused>
<urlopen error [Errno 110] Connection timed out>

The trace of the failure is:

Traceback (most recent call last):   File "test.py", line 4, in <module>     data = urllib2.urlopen(req).read()   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen     return _opener.open(url, data, timeout)   File "/usr/lib/python2.7/urllib2.py", line 400, in open     response = self._open(req, data)   File "/usr/lib/python2.7/urllib2.py", line 418, in _open     '_open', req)   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain     result = func(*args)   File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open     return self.do_open(httplib.HTTPSConnection, req)   File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open     raise URLError(err) urllib2.URLError: <urlopen error [Errno 111] Connection refused>

The above errors happen randomly, the script can run successfully the first time but then fails on the second run and vice versa.

What should I do to debug and figure out where the issue is coming from? How can I tell if the endpoint has consumed my request and returned a response but never reached me?

With telnet

I just tested with telnet, sometimes it succeeds, sometimes it doesn't, just like my Python.

On success:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... Connected to somewhere.com. Escape character is '^]'. Connection closed by foreign host.

On a refused connection:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection refused

On a timeout:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection timed out

581

asked Aug 27 '12 16:08

Thierry Lam

1 Answers

The problem

The problem is in the network layer. Here are the status codes explained:

Connection refused: The peer is not listening on the respective network port you're trying to connect to. This usually means that either a firewall is actively denying the connection or the respective service is not started on the other site or is overloaded.
Connection timed out: During the attempt to establish the TCP connection, no response came from the other side within a given time limit. In the context of urllib this may also mean that the HTTP response did not arrive in time. This is sometimes also caused by firewalls, sometimes by network congestion or heavy load on the remote (or even local) site.

In context

That said, it is probably not a problem in your script, but on the remote site. If it's occuring occasionally, it indicates that the other site has load problems or the network path to the other site is unreliable.

Also, as it is a problem with the network, you cannot tell what happened on the other side. It is possible that the packets travel fine in the one direction but get dropped (or misrouted) in the other.

It is also not a (direct) DNS problem, that would cause another error (Name or service not known or something similar). It could however be the case that the DNS is configured to return different IP addresses on each request, which would connect you (DNS caching left aside) to different addresses hosts on each connection attempt. It could in turn be the case that some of these hosts are misconfigured or overloaded and thus cause the aforementioned problems.

Debugging this

As suggested in the another answer, using a packet analyzer can help to debug the issue. You won't see much however except the packets reflecting exactly what the error message says.

To rule out network congestion as a problem you could use a tool like mtr or traceroute or even ping to see if packets get lost to the remote site (see below though).

If network congestion is not a problem (i.e. not more than, say, 1% of the packets get lost), you should contact the remote server administrator to figure out what's wrong. He may be able to see relevant infos in system logs. Running a packet analyzer on the remote site might also be more revealing than on the local site. Checking whether the port is open using netstat -tlp is definetly recommended then.

Interpreting traceroute results

This takes some practice, because high latency or loss at an intermediate hop may mean everything or nothing.

Intermediate hops are typically big routers in the internet or the ISPs network which deal with a lot of packets. They may have better things to do than replying to your traceroute, so they may choose to only reply to 10% of the requests if they are very busy currently. Or choose not to reply at all. If you do not see loss at your last hop, you are probably fine loss-wise.

However, if you do see loss at the last hop, you cannot be sure that the packet really got lost at the last hop. Any of the intermediate hops may be responsible. Typically, you'll also see loss at earlier hops then, which may indicate the real source.

To add insult to injury, it is possible that the route you see is not the real route: The real route may be asymmetric, meaning that the to your destination (which is what you see in traceroute) takes a different path than the reply (which you cannot see in traceroute due to how it works).

To summarize:

Loss observed in traceroute can only be caused by a hop equal to or before the hop which you see.
Loss at an intermediate hop, without end-to-end loss, may just mean that the hop does not bother to reply.
Forward path (what you see in traceroute) may be unequal to the reverse path; loss and latency may occur at the reverse path.
Partial loss (1%-90%) which starts in the middle of a route (and affects all later hops) typically indicates network congestion. Typically, you won't be able to do anything about it.

answered Oct 17 '22 22:10

Jonas Schäfer

Related questions
                            
                                datetime range filter in PySpark SQL
                            
                                How can I log outside of main Flask module?
                            
                                How do you get the name of the tensorflow output nodes in a Keras Model?
                            
                                Why isn't this code to plot a histogram on a continuous value Pandas column working?
                            
                                Python 3 type hinting for decorator
                            
                                Looping through files in a folder
                            
                                Matplotlib 3D Scatter Plot with Colorbar
                            
                                Python follow redirects and then download the page?
                            
                                When to use class versus dict in python? [closed]
                            
                                Is it possible to show `print` output as LaTeX in jupyter notebook?
                            
                                How do I catch a psycopg2.errors.UniqueViolation error in a Python (Flask) app?
                            
                                Building Python and more on missing modules
                            
                                how to copy modules from one virtualenv to another
                            
                                Groupby with User Defined Functions Pandas
                            
                                Google search using python script [closed]
                            
                                Recursive version of 'reload'
                            
                                Error when checking model input: expected lstm_1_input to have 3 dimensions, but got array with shape (339732, 29)
                            
                                Why Pytorch officially use mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] to normalize images?
                            
                                What is the Pythonic Way of Differentiating Between a String and a List?
                            
                                Speeding up build process with distutils

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With