[See updates at bottom]
I'm using JDBC to run statements against SQL Server 2008 R2 on a Windows 2008 R2 machine from a machine running Ubuntu 10.04 LTS with the 2.6.32-32-server kernel. I'm using the current Sun Java 6 build for Ubuntu (sun-java6-jdk 6.24-1build0.10.04.1) and MS's current JDBC 3.0 driver (sqljdbc_3.0.1301.101_enu).
When a statement takes longer than 40 seconds to complete and it doesn't return a ResultSet (e.g. 'stmt.executeUpdate("SELECT * INTO BAR FROM FOO")'), the program terminates with a connection reset:
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1339)
at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1654)
at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:3694)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:5022)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:773)
at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:676)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:179)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:154)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:633)
at TestTimeout.main(TestTimeout.java:42)
If my statement does return a ResultSet (e.g. 'ResultSet res = stmt.executeQuery("SELECT * FROM FOO")'), the connection doesn't time out.
When I run the same statement not returning a ResultSet against a copy of the database in SQL2005 on Win2003R2, the statement completes without a connection reset at 40 seconds.
I enabled logging and compared the logs for the SQL2005 statement that finishes with the SQL2008R2 statement that doesn't finish, and they are line-for-line equivalent up to the connection reset message in the 2008 query; see the line at 12:54:47 PM:
Jun 6, 2011 12:54:07 PM com.microsoft.sqlserver.jdbc.TDSCommand onRequestComplete
FINEST: TDSCommand@7ac2b2f6 (SQLServerStatement:1 executeXXX): request complete
Jun 6, 2011 12:54:07 PM com.microsoft.sqlserver.jdbc.TDSCommand startResponse
FINEST: TDSCommand@7ac2b2f6 (SQLServerStatement:1 executeXXX): Reading response...
Jun 6, 2011 12:54:47 PM com.microsoft.sqlserver.jdbc.TDSChannel read
FINE: TDSChannel (ConnectionID:1) read failed:Connection reset
Jun 6, 2011 12:54:47 PM com.microsoft.sqlserver.jdbc.SQLServerException logException
FINE: *** SQLException:ConnectionID:1 com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset Connection reset
Jun 6, 2011 12:54:47 PM com.microsoft.sqlserver.jdbc.SQLServerException logException
FINE: com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352)com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1339)com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1654)com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:3694)com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:5022)com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:773)com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:676)com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575)com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400)com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:179)com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:154)com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:633)TestTimeout.main(TestTimeout.java:42)
[...]
Here are corresponding lines from the statement against the 2005 database that works:
Jun 6, 2011 2:02:20 PM com.microsoft.sqlserver.jdbc.TDSCommand onRequestComplete
FINEST: TDSCommand@4737371 (SQLServerStatement:1 executeXXX): request complete
Jun 6, 2011 2:02:20 PM com.microsoft.sqlserver.jdbc.TDSCommand startResponse
FINEST: TDSCommand@4737371 (SQLServerStatement:1 executeXXX): Reading response...
Jun 6, 2011 2:02:57 PM com.microsoft.sqlserver.jdbc.TDSChannel logPacket
FINEST: /XXX.XXX.XXX.XXX:60091 SPID:73 TDSReader@6 (ConnectionID:1) received Packet:1 (13 bytes)
XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX .....I..........
XX XX XX XX XX .....
Jun 6, 2011 2:02:57 PM com.microsoft.sqlserver.jdbc.TDSCommand onResponseEOM
FINEST: TDSCommand@4737371 (SQLServerStatement:1 executeXXX): disabling interrupts
Jun 6, 2011 2:02:57 PM com.microsoft.sqlserver.jdbc.TDSReader nextPacket
FINEST: TDSReader@6 (ConnectionID:1) Moving to next packet -- unlinking consumed packet
Jun 6, 2011 2:02:57 PM com.microsoft.sqlserver.jdbc.TDSParser parse
FINEST: TDSReader@6 (ConnectionID:1): getNextResult: Processing TDS_DONE (0xFD)
[...]
I used tcpdump to capture all traffic between the SQL Server host and the Linux host as well as all ICMP traffic, and I notice that the 2008 and 2005 servers both send the Linux a TCP keep-alive packet 30 seconds after the statement starts executing; the Linux host acknowledges the keep-alive from the 2005 server with an ACK, but in connections to the 2008 server, the Linux host sends no ACK, and the 2008 server retransmits the keep-alive 9 times (once per second) before resetting the connection (hence the 40-second period until timeout). Now I noticed that there's a difference between the keep-alive packets transmitted by the Win2003/SQL2005 and Win2008R2/SQL2008R2 hosts: the newer OS uses TCP window scaling with a window size of 66560. So now I wonder if the TCP window size > 65535 is causing iptables or the tcp/ip stack on the Linux machine to ignore the packet silently. But then other packets earlier in the connection also have a scaled window size of 66560, and they are acknowledged by the Linux server. There is nothing in the log files to indicate that these packets are being dropped or are causing any sort of problem.
One final note: in the course of chasing this problem down we've had to reboot the Linux server a couple of times because of updates, and both times the connections ran without timing out for one or two days.
So I'm puzzled, and I'm hoping one of you might have a clue for me.
Update
I've discovered that I can eliminate the connection timeout by disabling tcp timestamps on the Linux server. Disabling window scaling has no effect on the problem. Pursuing the implications of disabling tcp timestamps seems more a question for serverfault.com, so I'll see about migrating this question over there.
Update 2
Comparing the packet traces for the connection that works (Win2003/SQL2003) with the one that doesn't (Win2008R2/SQL2008R2), I notice that the keepalive for the Win2003 connection have no options (even though it uses tcp timestamps in earlier packets), and the keepalive for the connection that is broken (unless timestamps are disabled) does have tcp options in the keepalive, namely the timestamps. So now it looks like the Ubuntu machine responds to keepalives without tcp options but ignores keeplives with tcp options. This is really a question about tcp/ip issues on the two hosts.
Final Update I pursued this question on the Linux networking dev list, and I'm now persuaded that the problem is due to a Windows bug that causes bad checksums to be generated for tcp keepalives that have tcp timestamps (but, evidently, for no other packets). See the thread on the netdev list. This question should be closed.
It turns out that the tcp keepalives with tcp timestamps sent by Win2008 had incorrect tcp checksums, which led the Linux host to correctly ignore them. This problem is almost certainly a Windows bug rather than a programming or Linux kernel problem. See this thread on the Linux networking dev list.
I saw a similar problem with Windows Server 2003 and SQL server having to do with Scalable Networking Pack (SNP) on a computer that has a TCP/IP Offload-enabled network adapter.
Info and solution here:
http://support.microsoft.com/kb/948496
We've encountered this same problem at my current employer. We found out just by chance that if the Windows 2008 R2 server has "Chimney Offload State" set to Enabled, then this same disconnect happens after 40 seconds. If that setting is set to Automatic, then it worked just fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With