Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recovering from Redis connectivity loss

I am looking for the reference implementation of the recovery [in multi-threaded environment] after Redis connection has been lost. Was not able to find anything that makes sense so far.

Setup: I have an Azure worker role that runs the same code (ThreadProc) in multiple threads. Initially, I had static ConnectionMultiplexer and did .GetDatabase() before every Redis operation. That didn't pass stress testing at all (tons of "can't connect" errors once the load increased from low to moderate). I have changed it to this:

static readonly ConnectionMultiplexer _connection = ConnectionMultiplexer.Connect(...);
static readonly IDatabase _cache = _connection.GetDatabase();

void ThreadProc() // running in multiple threads
{
    while (true)
    {
      // using _cache here
    }
}

That worked pretty well even under high load (1000+ op/s per worker role instance) until I got "No connection is available to service this operation", and the thing can't recover since.

Please let me know what would be the correct/recommended code that can recover from intermittent connectivity problems.

like image 834
Konstantin Surkov Avatar asked Jul 23 '14 23:07

Konstantin Surkov


2 Answers

Edit 2015-05-02: Although the later versions of the StackExchange.Redis client are explicitly supposed to handle this sort of "lost connection" -> "reconnect" logic internally and automagically, my testing has shown beyond any real doubt that they don't manage to pull it off successfully, and hence this sort of thing is still needed, at least in busy environments. I yanked the code below out of my caching layer for a while, and ended up with tens of thousands of connection failure errors. I put it back in, and those all went away.

Edit 2015-02-24: This method should no longer be necessary. The latest versions of the StackExchange.Redis client correctly handle being disconnected - they'll reconnect themselves up automatically, and the workaround below just interferes with things. Keeping it here for historical purposes, but my recommendation is to ignore it.


Here's a few methods from my SimpleCacheRedis<T> wrapper that show how I'm handling the problem:

public async Task<TValue> GetAsync(string key, Func<Task<TValue>> missingFunc)
{
    key = GetKey(key);
    var value = default(TValue);
    try
    {
        var db = _connection.GetDatabase();
        var str = await db.StringGetAsync(key);
        if (!str.IsNullOrEmpty)
        {
            value = _jsonSerializer.Deserialize<TValue>(str);
        }
    }
    catch (RedisConnectionException ex)
    {
        HandleRedisConnectionError(ex);
    }
    catch (Exception ex)
    {
        _logger.Error("Error retrieving item '" + key +
                      "' from Redis cache; falling back to missingFunc(). Error = " + ex);
    }
    if (value == default(TValue))
    {
        present = false;
        value = await missingFunc();
        await PerformAddAsync(key, value);
    }
    return value;
}

private void HandleRedisConnectionError(RedisConnectionException ex)
{
    _logger.Error("Connection error with Redis cache; recreating connection for the next try, and falling back to missingFunc() for this one. Error = " + ex.Message);
    Task.Run(async () =>
    {
        try
        {
            await CreateConnectionAsync();
        }
        catch (Exception genEx)
        {
            _logger.Error("Unable to recreate redis connection (sigh); bailing for now: " + genEx.Message);
        }
    });
}

private async Task CreateConnectionAsync()
{
    if (_attemptingToConnect) return;
    var sw = new StringWriter();
    try
    {
        _attemptingToConnect = true;
        _connection = await ConnectionMultiplexer.ConnectAsync(_redisCs, sw);
    }
    catch (Exception ex)
    {
        _logger.Error("Unable to connect to redis async: " + ex);
        _logger.Debug("internal log: \r\n" + sw);
        throw;
    }
    finally
    {
        _attemptingToConnect = false;
    }
}

Basically, if I see that I can't connect to Redis because of a RedisConnectionException, I spin off a separate async task to recreate the shared connection. That call, of course, may fail, but calls were going to be failing during that period anyway. And as soon as it succeeds, any new calls will use that newly (re-)created connection. Like I said above, kinda boring.

My situation may be a little different from yours, in that I'm not using Redis as a permanent store, but just as a cache. That means the only impact from a lost redis connection is that I need to retrieve the result from the DB rather than from cache. So I can afford to treat certain things a little loosely.

like image 93
Ken Smith Avatar answered Nov 10 '22 14:11

Ken Smith


All right, I guess I'll answer my own question if nobody wants to, although it seems strange, being such a basic use case.

Here is the class managing connectivity loss:

static class RedisConnectionManager
{
    private static readonly Dictionary<string, IDatabase> _dictionary = new Dictionary<string, IDatabase>();

    internal static IDatabase GetDatabase(string connectionString)
    {
        lock (_dictionary)
        {
            if (!_dictionary.ContainsKey(connectionString))
                _dictionary.Add(connectionString, ConnectionMultiplexer.Connect(connectionString).GetDatabase());
            if (!_dictionary[connectionString].Multiplexer.IsConnected)
            {
                _dictionary[connectionString].Multiplexer.Dispose();
                _dictionary[connectionString] = ConnectionMultiplexer.Connect(connectionString).GetDatabase();
            }
            return _dictionary[connectionString];
        }
    }
}

This class handles multiple connection strings, so if you have just one, the code will be even simpler. Please note explicit Multiplexer.Dispose() call. Since underlying object owns physical TCP connection, you can't wait until GC kicks in to release the resources. By that time, depending on your load, you may have thousands of orphaned TCP connections hanging around.

This code works reasonably well, but I am still not 100% sure this is the best way to handle this. Please let me know if anyone has the idea how to improve this.

like image 38
Konstantin Surkov Avatar answered Nov 10 '22 14:11

Konstantin Surkov