Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the cause of bad TCP connections

We're developing an online game where players communicate with the server using a persistent TCP connection. Persistent as in, its lifetime is that of a player's session, and if the connection is closed, the player is thrown from the game (though the client will attempt to automatically reconnect).

Problem

Now, of course everything works fine in our office (connecting to both testing and live servers), but our client reports that some players get disconnected a lot (every few seconds), and that they experience it themselves too (though their offices are in the same building).

Question

How can I find out the cause of these disconnects? Is it because:

  • Players have bad internet connections and it can't be helped.
  • The distance between players and server (Turkey <-> Netherlands) is too long.
  • Something is wrong with the server (a CentOS machine) or the datacenter.
  • The server is overloaded (though it happens under low loads too).
  • There is an error in our software.
  • Or some other reason?

The software is written in Java. It logs when players are disconnected, and if it actively kicks them (e.g. for not sending keep-alive messages) it logs that too.

Known data

  • Whenever a spurious disconnect is reported and I check the logs, most of the time I don't see that player getting actively kicked by the server software, only see that the connection has been closed.
  • There is an internal monitoring service which has a bunch of localhost connections to the game server, the same way players do, and it doesn't get disconnected.

Others

There are many other online games like ours. How do they deal with this? (Unless the problem is in the server/datacenter, then the solution is obvious)

  • Do they use UDP? I know action games do, for speed, but I presume TCP is normal for e.g. online poker and other slow games? (Not that that would help us, our client software is made in Flash, which doesn't support UDP)
  • Is there some TCP tweaking that can be done to make it more lenient?
  • Or do they get these disconnects as well, just reconnect more transparently?
  • Is there information about this on the web?
like image 439
Bart van Heukelom Avatar asked Nov 14 '22 10:11

Bart van Heukelom


1 Answers

I would ask players to allow you to enable "anonymous usage data", like many apps do, to periodically upload debugging information from their sessions back to you. This is how you figure out these sorts of situations.

From there, what you'll need when a disconnect happens, is a pretty verbose log. When the disconnect happens, catch whatever exception was thrown (and don't forget to also log the cause via a call to .getCause() - making as many calls to .getCause() as necessary until you've logged all the way back to the root cause), as well as any relevant data you need to match up the client log with the server-side logs. Information you'll likely need includes like session IDs, game IDs, timestamps, etc. Just think, "What information do I think I would need in order to troubleshoot this, assuming I had insight into both sides of the connection?" which is what you'll ultimately get with asking users to upload usage and debugging data.

From there you should be able to figure out at least a few situations where you have control over it - that is, where you can change your client/server code in order to alleviate some of the problems. In some cases, where the problem is either a client's configuration or faulty equipment (or maybe a piece of equipment in between that neither of your control), you'll have to rely on robust re-connectivity.

You'll never reduce disconnects to zero, but this information, after you see enough cases of it, should help you reduce the occurrence of disconnects to the situations that are outside of your control alone, at which point your power to shape the network will ultimately end, and you'll be as close to a "best case scenario" with network reliability as you can be.

like image 175
jefflunt Avatar answered Nov 16 '22 02:11

jefflunt