Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select system call hangs indefinitely in a n/w application.

We have a networking application, it will be used inside various scripts to communicate with other systems.

Occasionally the scripts hang on a call to our networking application. We recently experienced a hang, and I tried to debug the hung process of this particular application.

This application consists of a client and a server(a daemon), the hang occurs on client side.

Strace output showed me that it's hung on a select system call.

> strace -p 34567
select(4, [3], NULL, NULL, NULL

As you can see there's no timeout given on select call, it can block indefinitely if the file descriptor '3' is not ready for reading.

lsof output showed that fd '3' is in FIN_WAIT2 state.

> lsof -p 34567
client  34567 user 3u  IPv4 55184032 TCP client-box:smar-se-port2->server:daemon (FIN_WAIT2)

Does the above information imply something? FIN_WAIT2 state? I checked on the server side(where corresponding daemon process should be running), but there are no daemon processes running on server side. My guess is the daemon ran successfully and sent the output to client, which should be available on fd '3' for reading, but the select() call on client never comes out, and still waits for something to happen!

I am not sure why it never comes out of select() call, this only happens occasionally, most of the times the application just works fine.

Any clues?

Both Server and client are SuSE Linux.

like image 712
ernesto Avatar asked Feb 02 '14 15:02

ernesto


1 Answers

FIN_WAIT2 means your app has sent a FIN packet to the peer, but has not received a FIN from the peer yet. In TCP, a graceful close requires a FIN from both parties. The fact that the server daemon is not running means the daemon exited (or was killed) without notifying its peer (you). So your select() is waiting for packets it will no longer receive, and has to wait for the OS to invalidate the socket using an internal timeout, which can take a long time. This is the kind of situation why you should never use infinite timeouts. Use an appropriate timeout and act accordingly if the timeout elapses.

like image 61
Remy Lebeau Avatar answered Oct 09 '22 04:10

Remy Lebeau