Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I debug what is causing a connection refused or a connection time out?

I have the following code that has worked for about a year:

import urllib2  req = urllib2.Request('https://somewhere.com','<Request></Request>') data = urllib2.urlopen(req) print data.read() 

Lately, there have been some random errors:

  • urllib2.URLError: <urlopen error [Errno 111] Connection refused>
  • <urlopen error [Errno 110] Connection timed out>

The trace of the failure is:

Traceback (most recent call last):   File "test.py", line 4, in <module>     data = urllib2.urlopen(req).read()   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen     return _opener.open(url, data, timeout)   File "/usr/lib/python2.7/urllib2.py", line 400, in open     response = self._open(req, data)   File "/usr/lib/python2.7/urllib2.py", line 418, in _open     '_open', req)   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain     result = func(*args)   File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open     return self.do_open(httplib.HTTPSConnection, req)   File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open     raise URLError(err) urllib2.URLError: <urlopen error [Errno 111] Connection refused> 

The above errors happen randomly, the script can run successfully the first time but then fails on the second run and vice versa.

What should I do to debug and figure out where the issue is coming from? How can I tell if the endpoint has consumed my request and returned a response but never reached me?

With telnet

I just tested with telnet, sometimes it succeeds, sometimes it doesn't, just like my Python.

On success:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... Connected to somewhere.com. Escape character is '^]'. Connection closed by foreign host. 

On a refused connection:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection refused 

On a timeout:

$ telnet somewhere.com 443 Trying XXX.YY.ZZZ.WWW... telnet: Unable to connect to remote host: Connection timed out 
like image 581
Thierry Lam Avatar asked Aug 27 '12 16:08

Thierry Lam


People also ask

What causes a connection refused?

A Connection Refused (IP Address) error occurs when: You use the wrong IP address in the connection string. Use the database's private IP address in the connection string and try to connect from a Droplet that isn't allowed to access the VPC network.

What is the difference between connection timeout and connection refused?

Connection timeout probably means either that the host / port is firewalled, OR the host is "off". Connection refused probably means that the host is not running any service on the port you are trying to connect to.

Can't connect to connection timed out?

The error indicates that the server didn't respond to the client and the client program gave up (timed out). The following are common causes for this error: The security group or network ACL doesn't allow access. There is a firewall on the instance's operating system.

What reasons might cause a server to refuse a connection request from a client?

The two most common causes of this are: Misconfiguration, such as where a user has mistyped the port number, or is using stale information about what port the service they require is running on. A service error, such as where the service that should be listening on a port has crashed or is otherwise unavailable.


1 Answers

The problem

The problem is in the network layer. Here are the status codes explained:

  • Connection refused: The peer is not listening on the respective network port you're trying to connect to. This usually means that either a firewall is actively denying the connection or the respective service is not started on the other site or is overloaded.

  • Connection timed out: During the attempt to establish the TCP connection, no response came from the other side within a given time limit. In the context of urllib this may also mean that the HTTP response did not arrive in time. This is sometimes also caused by firewalls, sometimes by network congestion or heavy load on the remote (or even local) site.

In context

That said, it is probably not a problem in your script, but on the remote site. If it's occuring occasionally, it indicates that the other site has load problems or the network path to the other site is unreliable.

Also, as it is a problem with the network, you cannot tell what happened on the other side. It is possible that the packets travel fine in the one direction but get dropped (or misrouted) in the other.

It is also not a (direct) DNS problem, that would cause another error (Name or service not known or something similar). It could however be the case that the DNS is configured to return different IP addresses on each request, which would connect you (DNS caching left aside) to different addresses hosts on each connection attempt. It could in turn be the case that some of these hosts are misconfigured or overloaded and thus cause the aforementioned problems.

Debugging this

As suggested in the another answer, using a packet analyzer can help to debug the issue. You won't see much however except the packets reflecting exactly what the error message says.

To rule out network congestion as a problem you could use a tool like mtr or traceroute or even ping to see if packets get lost to the remote site (see below though).

If network congestion is not a problem (i.e. not more than, say, 1% of the packets get lost), you should contact the remote server administrator to figure out what's wrong. He may be able to see relevant infos in system logs. Running a packet analyzer on the remote site might also be more revealing than on the local site. Checking whether the port is open using netstat -tlp is definetly recommended then.

Interpreting traceroute results

This takes some practice, because high latency or loss at an intermediate hop may mean everything or nothing.

Intermediate hops are typically big routers in the internet or the ISPs network which deal with a lot of packets. They may have better things to do than replying to your traceroute, so they may choose to only reply to 10% of the requests if they are very busy currently. Or choose not to reply at all. If you do not see loss at your last hop, you are probably fine loss-wise.

However, if you do see loss at the last hop, you cannot be sure that the packet really got lost at the last hop. Any of the intermediate hops may be responsible. Typically, you'll also see loss at earlier hops then, which may indicate the real source.

To add insult to injury, it is possible that the route you see is not the real route: The real route may be asymmetric, meaning that the to your destination (which is what you see in traceroute) takes a different path than the reply (which you cannot see in traceroute due to how it works).

To summarize:

  • Loss observed in traceroute can only be caused by a hop equal to or before the hop which you see.
  • Loss at an intermediate hop, without end-to-end loss, may just mean that the hop does not bother to reply.
  • Forward path (what you see in traceroute) may be unequal to the reverse path; loss and latency may occur at the reverse path.
  • Partial loss (1%-90%) which starts in the middle of a route (and affects all later hops) typically indicates network congestion. Typically, you won't be able to do anything about it.
like image 55
Jonas Schäfer Avatar answered Oct 17 '22 22:10

Jonas Schäfer