Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sql Server Transient Exception Numbers

Tags:

c#

sql-server

I want to write some wrapper code to my database calls (using C#, and a Microsoft technology for accessing the database), auto-retrying on a 'transient' exception. By transient, I mean something that there is a good chance will resolve eventually (as against logically errors that will never work). Examples that I can think of include:

  • Deadlock
  • Connection Timeout
  • Command timeout

I had planned on using SqlException's error numbers to spot these. So for example:

List<RunStoredProcedureResultType> resultSet = null;
int limit = 3;
for (int i = 0; i < limit; ++i)
{
    bool isLast = i == limit - 1;
    try
    {
        using (var db = /* ... */)
        {
            resultSet = db.RunStoredProcedure(param1, param2).ToList();
        }
        //if it gets here it was successful
        break;
    }
    catch (SqlException ex)
    {
        if (isLast)
        {
            //3 transient errors in a row. So just kill it
            throw;
        }
        switch (ex.Number)
        {
            case 1205: //deadlock
            case -2:   //timeout (command timeout?)
            case 11:   //timeout (connection timeout?)
                // do nothing - continue the loop
                break;
            default:
                //a non-transient error. Just throw the exception on
                throw;
        }
    }
    Thread.Sleep(TimeSpan.FromSeconds(1)); //some kind of delay - might not use Sleep
}
return resultSet;

(excuse me for any bugs - I've just written that on the fly. I also realise I could wrap it up nicely...)

So the key question is: what numbers should I consider 'transient' (I realise what I consider transient may be different to what other people consider transient). I've found a nice list here:

https://msdn.microsoft.com/en-us/library/cc645603.aspx

but its massive and note very useful. Has anyone else built up a list that they use for something similar?

UPDATE

In the end, we opted for a 'bad list' - if the error is a one of a list of known 'non-transient errors' - which are generally programmer errors. I've included a list of numbers we are using as an answer.

like image 387
thab Avatar asked Feb 25 '16 14:02

thab


People also ask

What is SQL transient error?

A transient error, also known as a transient fault, has an underlying cause that soon resolves itself. An occasional cause of transient errors is when the Azure system quickly shifts hardware resources to better load-balance various workloads. Most of these reconfiguration events finish in less than 60 seconds.

What information does @@ error hold in SQL?

Using @@ERROR with @@ROWCOUNT. The following example uses @@ERROR with @@ROWCOUNT to validate the operation of an UPDATE statement. The value of @@ERROR is checked for any indication of an error, and @@ROWCOUNT is used to ensure that the update was successfully applied to a row in the table.

What is transient error?

Transient errors are errors which are recoverable: a connection was temporarily not available, timeouts etc.. Some strategies check for transient errors, and if such an error occurs, they will retry, and otherwise fail. Other strategies will always retry, no matter what the error is.

What is SqlAzureExecutionStrategy?

The SqlAzureExecutionStrategy will retry instantly the first time a transient failure occurs, but will delay longer between each retry until either the max retry limit is exceeded or the total time hits the max delay.


2 Answers

There's a class [SqlDatabaseTransientErrorDetectionStrategy.cs] in sql Azure for transient fault handling. It covers almost all types of exception code that can be considered as transient. Also it's a complete implementation of a Retry strategy.

Adding the snippet here for future reference:

/// <summary>
/// Error codes reported by the DBNETLIB module.
/// </summary>
private enum ProcessNetLibErrorCode
{
    ZeroBytes = -3,

    Timeout = -2,
    /* Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. */

    Unknown = -1,

    InsufficientMemory = 1,

    AccessDenied = 2,

    ConnectionBusy = 3,

    ConnectionBroken = 4,

    ConnectionLimit = 5,

    ServerNotFound = 6,

    NetworkNotFound = 7,

    InsufficientResources = 8,

    NetworkBusy = 9,

    NetworkAccessDenied = 10,

    GeneralError = 11,

    IncorrectMode = 12,

    NameNotFound = 13,

    InvalidConnection = 14,

    ReadWriteError = 15,

    TooManyHandles = 16,

    ServerError = 17,

    SSLError = 18,

    EncryptionError = 19,

    EncryptionNotSupported = 20
}

Further a switch case to check if the error number returned in sql exception:

switch (err.Number)
{
    // SQL Error Code: 40501
    // The service is currently busy. Retry the request after 10 seconds. Code: (reason code to be decoded).
    case ThrottlingCondition.ThrottlingErrorNumber:
        // Decode the reason code from the error message to determine the grounds for throttling.
        var condition = ThrottlingCondition.FromError(err);

        // Attach the decoded values as additional attributes to the original SQL exception.
        sqlException.Data[condition.ThrottlingMode.GetType().Name] =
            condition.ThrottlingMode.ToString();
        sqlException.Data[condition.GetType().Name] = condition;

        return true;

    // SQL Error Code: 10928
    // Resource ID: %d. The %s limit for the database is %d and has been reached.
    case 10928:
    // SQL Error Code: 10929
    // Resource ID: %d. The %s minimum guarantee is %d, maximum limit is %d and the current usage for the database is %d. 
    // However, the server is currently too busy to support requests greater than %d for this database.
    case 10929:
    // SQL Error Code: 10053
    // A transport-level error has occurred when receiving results from the server.
    // An established connection was aborted by the software in your host machine.
    case 10053:
    // SQL Error Code: 10054
    // A transport-level error has occurred when sending the request to the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
    case 10054:
    // SQL Error Code: 10060
    // A network-related or instance-specific error occurred while establishing a connection to SQL Server. 
    // The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server 
    // is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed 
    // because the connected party did not properly respond after a period of time, or established connection failed 
    // because connected host has failed to respond.)"}
    case 10060:
    // SQL Error Code: 40197
    // The service has encountered an error processing your request. Please try again.
    case 40197:
    // SQL Error Code: 40540
    // The service has encountered an error processing your request. Please try again.
    case 40540:
    // SQL Error Code: 40613
    // Database XXXX on server YYYY is not currently available. Please retry the connection later. If the problem persists, contact customer 
    // support, and provide them the session tracing ID of ZZZZZ.
    case 40613:
    // SQL Error Code: 40143
    // The service has encountered an error processing your request. Please try again.
    case 40143:
    // SQL Error Code: 233
    // The client was unable to establish a connection because of an error during connection initialization process before login. 
    // Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy 
    // to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
    case 233:
    // SQL Error Code: 64
    // A connection was successfully established with the server, but then an error occurred during the login process. 
    // (provider: TCP Provider, error: 0 - The specified network name is no longer available.) 
    case 64:
    // DBNETLIB Error Code: 20
    // The instance of SQL Server you attempted to connect to does not support encryption.
    case (int)ProcessNetLibErrorCode.EncryptionNotSupported:
        return true;
}

See complete source here.

like image 149
vendettamit Avatar answered Sep 19 '22 15:09

vendettamit


Sorry to answer my own question, but if anyone is still interested, we've just started building up our own list of error codes. Not ideal, but we figured this shouldn't happen too often.

We opted for a 'bad list' approach, instead of the 'good list' as implied in the question. The ids we have so far are:

PARAMETER_NOT_SUPPLIED = 201;
CANNOT_INSERT_NULL_INTO_NON_NULL = 515;
FOREGIN_KEY_VIOLATION = 547;
PRIMARY_KEY_VIOLATION = 2627;
MEMORY_ALLOCATION_FAILED = 4846;
ERROR_CONVERTING_NUMERIC_TO_DECIMAL = 8114; 
TOO_MANY_ARGUMENTS = 8144;
ARGUMENT_IS_NOT_A_PARAMETER = 8145;
ARGS_SUPPLIED_FOR_PROCEDURE_WITHOUT_PARAMETERS = 8146;
STRING_OR_BINARY_TRUNCATED = 8152;
INVALID_POINTER = 10006;
WRONG_NUMBER_OF_PARAMETERS = 18751;

Another thing we noticed is that if the connection pool times out you don't get a SqlException - instead you get a InvalidOperationException reporting "Timeout expired". It's a shame its not a SqlException but well worth catching.

I'll try to keep this up to date with any additions.

like image 23
thab Avatar answered Sep 16 '22 15:09

thab