Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Thrift TSimpleServer becomes unresponsive after several successful requests

I have a Thrift API served from a Java application running on Linux. I'm using a .NET client to connect to the API and execute operations.

The first few calls to the service work fine without errors, but then (seemingly at random) a call will "hang." If I force-quit my client and try to reconnect, the service either hangs again, or my client has the following error:

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at Thrift.Transport.TStreamTransport.Read(Byte[] buf, Int32 off, Int32 len) 
   (etc.)

When I use JConsole to get a thread dump, the server is on accept()

"Thread-1" prio=10 tid=0x00002aaad457a800 nid=0x79c7 runnable [0x00000000434af000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
    - locked <0x00000005c0fef470> (a java.net.SocksSocketImpl)
    at java.net.ServerSocket.implAccept(ServerSocket.java:462)
    at java.net.ServerSocket.accept(ServerSocket.java:430)
    at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:113)
    at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
    at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
    at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:63)

netstat on the sever shows connections to the service port that are on TIME_WAIT which eventually disappear several minutes after I force-quit the client (as would be expected).

The code that sets up the Thrift service is as follows:

        int port = thriftServicePort;
        String host = thriftServiceHost;
        InetAddress adr = InetAddress.getByName(host);
        InetSocketAddress address = new InetSocketAddress(adr, port);
        TServerTransport serverTransport = new TServerSocket(address);
        TServer server = new TSimpleServer(new TServer.Args(serverTransport).processor((org.apache.thrift.TProcessor)processor));

        server.serve();

Note that we're using the TServerTransport constructor that takes an explicit hostname or IP address. I suspect that I should change it to take the constructor that only specifies a port (ultimately binding to InetAddress.anyLocalAddress()). Alternatively, I suppose I could configure the service to bind to the "wildcard" address ("0.0.0.0").

I should mention that the service is not hosted on the open Internet. It is hosted in a private network and I am using SSH tunneling to reach it. Hence, the hostname that the service is bound to does not resolve in my local network (although I can make the initial connection via tunneling). I wonder if this is something similar to the RMI TCP callback problem?

Is there a technical explanation for what's going on (if this is a common issue), or additional troublehshooting steps that I can take?

UPDATE

Had the same problem today, but this time jstack shows that the Thrift server is blocking forever reading from the input stream:

"Thread-1" prio=10 tid=0x00002aaad43fc000 nid=0x60b3 runnable [0x0000000041741000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
        at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:70)

So we need to set a "client timeout" in the TServerSocket constructor. But why would that cause the application to also refuse connections when blocking on accept()?

like image 558
noahlz Avatar asked Jan 24 '13 23:01

noahlz


2 Answers

From your stack trace it seems you are using TSimpleServer, whose javadocs say,

Simple singlethreaded server for testing.

Probably what you want to use is TThreadPoolServer.

Most likely what is happening is the single thread of TSimpleServer is blocked waiting for the dead client to respond or timeout. And because the TSimpleServer is single threaded, no thread is available to process other requests.

like image 142
sbridges Avatar answered Sep 24 '22 07:09

sbridges


I have some suggestions. You mentioned that the first few calls to the server works and then there are hangs. That's a clue. One scenario where this happens is when the client does not fully send the bytes to the server. I am not familiar with TSimpleServer, but I assume it listens on a port and has some binary protocol and expects any client to talk to it in that protocol. Your .net client is talking to this server by sending bytes. If its not correctly flushing its output buffer then it may not be sending all the bytes to the server thereby hanging the server.

In Java this could happen at the client side ,like this :

BufferedOutputStream stream = new BufferedOutputStream(socket.getOutputstream()) //get the socket stream to write 
stream.write(content);//write everything that needs to be written 
stream.flush();//if flush() is not called, could result in server getting incomplete packets resulting in hangs!!!

Suggestions :

a) Go through your .net client code. See if any part of the code that actually communicates to the server are properly calling the equivalent flush() or cleanup methods. Note : I saw from their documentation that their transport layer defines a flush(). You should scan your .net code and see if its using the transport methods. http://thrift.apache.org/docs/concepts/

b) For further debugging, you could try writing a small Java client that simulates your .net client. Run the java client on your linux machine (same machine where TSimpleServer runs). See if it causes same issue. If it does, you could debug your java client and find the root cause. If it doesn't, you could then run it on where your .net client runs and see if there any issues and take it from there.

Edit :c) I was able to see a sample thrift client code in Java here : https://chamibuddhika.wordpress.com/2011/10/02/apache-thrift-quickstart-tutorial/ I noticed transport.open(); //do some code transport.close(); As suggested in a) you could go though your .net client code and see if you are calling the transport methods flush() and close() on completion

like image 39
Zenil Avatar answered Sep 24 '22 07:09

Zenil