See strange behavior of redis cluster, which works totally fine on big load and starts to run with 50% timeout rate and unstable response times on low load.
We have same patter each day on periods of low load.
Any ideas what could cause such a strange pattern? Maybe some maintenance work this RedisCluster starts to do on low load time? Like slots rebalancing. Please recommend any settings or aspects to check.
Versions: Redis 2.0.7, Jedis 2.8.1
Configuration: 3 physical nodes with 9 master processes and 18 slaves.
JedisCluster Timeout = 5ms.
Load is 100% writes with setex.
This graphs are for JedisCluster response times, not actual RedisCluster times. "Sets" line here is successful sets actually, not total count.
As demand on your clusters changes, you might decide to improve performance or reduce costs by changing the number of shards in your Redis (cluster mode enabled) cluster. We recommend using online horizontal scaling to do so, because it allows your cluster to continue serving requests during the scaling process.
Redis scales horizontally with a deployment topology called Redis Cluster.
The Redis Cluster supports only one database - indicated if you have a big dataset - and Redis supports multiple databases. The Redis Cluster client must support redirection, while the client used for Redis doesn't need it.
Redis cluster uses a form of composite partitioning called consistent hashing that calculates what Redis instance the particular key shall be assigned to. This concept is called a hash slot in the Redis Cluster. The key space is partitioned across the different masters in the cluster.
Finally I found that it looks like network issue.
redis08(10.201.12.214) ~ $ redis-benchmark -h 10.201.12.215 -p 9006
====== PING_INLINE ======
100000 requests completed in 91.42 seconds
50 parallel clients
3 bytes payload
keep alive: 1
0.00% <= 11 milliseconds
redis09(10.201.12.215) ~ $ redis-benchmark -h 10.201.12.215 -p 9006
====== PING_INLINE ======
100000 requests completed in 1.41 seconds
50 parallel clients
3 bytes payload
keep alive: 1
99.46% <= 1 milliseconds
redis08 ~ $ ping lga-redis09
PING redis09 (10.201.12.215) 56(84) bytes of data.
64 bytes from redis09 (10.201.12.215): icmp_seq=1 ttl=64 time=10.7 ms
Looking at collectd's "if_octets" we have enormous network activity on network interfaces on this time of low write activity. Nighttime load is like 10x in comparison with daytime load.
And it is caused by redis nodes which start to actively exchange information on this low load period. Iptraf top connections output: While on daytime top in this iptraf report belongs fully to actual redis clients with good write load.
Finally found that we have issues with replication. Sometimes buffer was not enough and slaves started full resync. Looks like this night load - full resync attempts + low repl-timeout value - neverending replication attempts as a result. Why this replication affects low night load so significantly and didn't affect day time - I don't know, see no options that make redis do more often attempts on nights or something like that. If it's interesting, we fixed neverending replication by increasing obvious settings:
repl-backlog-size
repl-timeout
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With