Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tracking down database connection issues

Background

We have a number of web applications on different web servers that connect to a single database server. Over the past couple months, we have noticed that every once in awhile, our web servers won't be able to connect to the database server.

Our Environment

We have a couple different web environments, some running ColdFusion and others running .NET. The .NET apps are both Web Forms and MVC. They span multiple versions from 2.0 to 4.5. Both the ColdFusion and .NET web servers are windows based machines. Both the ColdFusion and .NET web environments are clustered and some of the machines are physical while others are virtual.

Our database server is SQL Server 2008 r2. It houses multiple databases. Each application has its own database user that it connects with to the server that only gives it access to a particular database.

Other Facts

  • When we notice issues, they occur in short bursts that last anywhere from a couple seconds to a couple minutes.
  • When we notice issues, the burst contains errors from multiple different appliations, not just one app at at time.
  • When we notice issues, the burst contains errors from applications from different web environments. (This makes us think we can rule out that the apps themselves are the issue)
  • The burst of connection issues happen at various times throughout the day and night. They are not always during times of high usage.
  • We have monitored things like number of user connections, memory, IO, CPU usage, etc... and we have not seen spikes or anything else that might point to a problem.
  • We have installed wireshark on the web and db servers in hopes of catching the problem without any success.

Questions

  1. Does anyone have suggestions on where I should look next?
  2. Are there properties of the database that could cause this?
  3. Is there any way to "monitor" the connection between the database and web server in a better manner?
  4. Is there anything that can be done on the app side to better understand what is happening?

Errors Caught by Apps

  • .NET errors
    • A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
    • Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
    • A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)
    • Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
  • ColdFusion errors
    • Error Executing Database Query. The TCP/IP connection to the host has failed. java.net.ConnectException: Connection timed out: connect
      The error occurred on line 38.
    • Error Executing Database Query. Connection reset by peer: socket write error
      The error occurred on line 91.
    • Error Executing Database Query. Timed out trying to establish connection
      The error occurred on line 38.
like image 662
Jason Avatar asked Oct 24 '12 14:10

Jason


1 Answers

In CF, I once had an issue like what you were seeing. I had CF on 1 server, and sql 2008 r2 on another server. I would see CF errors like you posted below. To help trace it to a network error I wrote something like this:

1) created a down.bat

tracert serverip

2) I then put a <cftry><cfcatch> around the query.

When the query generated the error I would execute

<cfexecute name="C:\path\to\down.bat" variable="log" timeout="60" />
    <cfmail to="ME" from="Server" subject="SQL DOWN">

    Server Debugging Info:
    ------------------------------------------------------------    
    #now()#

    #cfcatch.Detail#

    #cfcatch.Message#

    #log#        

    </cfmail>
</cfexecute>

This helped me fix my situation which ended up being hardware at the datacenter.

like image 127
steve Avatar answered Sep 27 '22 18:09

steve