Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell from Npgsql exception if the call is worth a retry (transient fault strategy)

I'm writing a service which will be connecting to a remote postgres server. I'm looking for a good way to determine which exceptions should be treated as transient (worth retrying), and how to define an appropriate policy for connecting to a remote database.

The service is using Npgsql for the data access. The documentation says that Npgsql will throw a PostgresException for sql errors and an NpgsqlException for "server related issues".

So far the best I have been able to come up with is to assume all exceptions that are not PostgresExceptions should be treated as possibly transient, worth retrying, but a PostgresException would mean that there is something wrong with the query and that retrying would not help. Am I correct in this assumption?

I am using Polly to create a Retry and Circuit Breaker policy. Thus, my policy looks like this:

Policy.Handle<Exception>( AllButPotgresExceptions()) // if its a postgres exception we know its not going to work even with a retry, so don't
                       .WaitAndRetryAsync(new[]
                       {
                           TimeSpan.FromSeconds(1),
                           TimeSpan.FromSeconds(2),
                           TimeSpan.FromSeconds(4)
                       }, onRetry: (exception, span) => Log.Warning(exception, "Postgres Retry Failure: "))
                    .WrapAsync(
                           Policy.Handle<Exception>( AllButPotgresExceptions())
                               .AdvancedCircuitBreakerAsync(
                                   failureThreshold:.7, 
                                   samplingDuration: TimeSpan.FromSeconds(30), 
                                   minimumThroughput: 20, 
                                   durationOfBreak: TimeSpan.FromSeconds(30), 
                                   onBreak: (ex, timeSpan, context) => Log.Warning(ex, "Postres Circuit Breaker Broken: "), 
                                   onReset: (context) => Log.Warning("Postres Circuit Breaker Reset: "), 
                                   onHalfOpen: () => Log.Warning("Postres Circuit Breaker Half Open: ")
                               )));
        }
    }

    private static Func<Exception, bool> AllButPotgresExceptions()
    {
        return ex => ex.GetType() != typeof(PostgresException);
    }

Is there a better way to determine which errors may be transient?

UPDATE:

Following Shay's suggestions I opened a new issue in Npgsql and updated my Policy to look like this:

public static Policy PostresTransientFaultPolicy
    {
        get
        {
            return postgresTransientPolicy ?? (postgresTransientPolicy = Policy.Handle<Exception>( PostgresDatabaseTransientErrorDetectionStrategy())
                       .WaitAndRetryAsync(
                            retryCount: 10, 
                            sleepDurationProvider: retryAttempt => ExponentialBackoff(retryAttempt, 1.4), 
                            onRetry: (exception, span) => Log.Warning(exception, "Postgres Retry Failure: "))
                    .WrapAsync(
                           Policy.Handle<Exception>( PostgresDatabaseTransientErrorDetectionStrategy())
                               .AdvancedCircuitBreakerAsync(
                                   failureThreshold:.4, 
                                   samplingDuration: TimeSpan.FromSeconds(30), 
                                   minimumThroughput: 20, 
                                   durationOfBreak: TimeSpan.FromSeconds(30), 
                                   onBreak: (ex, timeSpan, context) => Log.Warning(ex, "Postres Circuit Breaker Broken: "), 
                                   onReset: (context) => Log.Warning("Postres Circuit Breaker Reset: "), 
                                   onHalfOpen: () => Log.Warning("Postres Circuit Breaker Half Open: ")
                               )));
        }
    }

    private static TimeSpan ExponentialBackoff(int retryAttempt, double exponent)
    {
        //TODO add random %20 variance on the exponent
        return TimeSpan.FromSeconds(Math.Pow(retryAttempt, exponent));
    }

    private static Func<Exception, bool> PostgresDatabaseTransientErrorDetectionStrategy()
    {
        return (ex) =>
        {                
            //if it is not a postgres exception we must assume it will be transient
            if (ex.GetType() != typeof(PostgresException))
                return true;

            var pgex = ex as PostgresException;
            switch (pgex.SqlState)
            {
                case "53000":   //insufficient_resources
                case "53100":   //disk_full
                case "53200":   //out_of_memory
                case "53300":   //too_many_connections
                case "53400":   //configuration_limit_exceeded
                case "57P03":   //cannot_connect_now
                case "58000":   //system_error
                case "58030":   //io_error

                //These next few I am not sure whether they should be treated as transient or not, but I am guessing so

                case "55P03":   //lock_not_available
                case "55006":   //object_in_use
                case "55000":   //object_not_in_prerequisite_state
                case "08000":   //connection_exception
                case "08003":   //connection_does_not_exist
                case "08006":   //connection_failure
                case "08001":   //sqlclient_unable_to_establish_sqlconnection
                case "08004":   //sqlserver_rejected_establishment_of_sqlconnection
                case "08007":   //transaction_resolution_unknown
                    return true;
            }

            return false;
        };
    }
like image 545
Dominick O'Dierno Avatar asked Mar 15 '17 23:03

Dominick O'Dierno


People also ask

What does a transient error mean?

A transient error occurs during a connection try After a delay of several seconds, retry the connection.

What is exception handling in PostgreSQL?

PostgreSQL raises an exception is used to raise the statement for reporting the warnings, errors and other type of reported message within a function or stored procedure.


1 Answers

Your approach is good. NpgsqlException usually means a network/IO error, although you can examine the inner exception and check for IOException to be sure.

PostgresException is thrown when PostgreSQL reports an error, which in most cases is a problem with the query. However, there may be some transient server-side issues (e.g. too many connections), you can examine the SQL error code for that - see the PG docs.

It may be a good idea to add an IsTransient property to these exceptions, encoding these checks inside PostgreSQL itself - you're welcome to open an issue for that on the Npgsql repo.

like image 76
Shay Rojansky Avatar answered Sep 20 '22 12:09

Shay Rojansky