During a load test of the application (with dynamic load reporting services) the whole application stopped working because one replica of a stateful partition gives a warning.
Warning System.RAP IStatefulServiceReplica.ChangeRole(S)Duration Thu, 21 Jul 2016 3:36:03 GMT Infinity 131135817636324745 false false Start Time (UTC): 2016-07-21 13:35:43.632
This happens after a load balancing of the replica, this happened to the 4th replica of the partition eventhough we only target 3. So even if SF just kills it the application should be fine (As the primary and 2 other secondaries are up). However the whole thing jams. (from logging I can see at least 10k events still need to be processed but the whole thing stops)
In the images above you can see the details of the particular replica. The only differences between this replica and the other Secondary replica's is in the following values:
I also find it odd that the Replica Status says: Ready and not Reconfiguring anymore. As the read/write status says it is still reconfiguring I'm running the newest SDK (2.1.163, released 18-07-2016). I thought the bugfix was in there but eventhough it became much harder to reproduce it still occurred. Does anyone know what might be causing this or how to fix this?
edit: Screenshot of the failing partition
After the response of Vaclav I started to log everything in the RunAsync to determine what was actually causing the problem. So what part of the code did not exit if the cancellation was requested. As Vaclav pointed out the method did not stop when the cancellation was requested. However it seems like the code section in which it gets stuck is native Service Fabric.
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
await tx.CommitAsync();
}
The queue is a ReliableQueue, the time-out is set to the default 4 seconds and the cancelationtoken is from RunAsync. After adding the logging between each line we got the following logging pattern
//pre transaction
using(ITransaction tx = StateManager.CreateTransaction())
{
//pre dequeue
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
//dequeued
await tx.CommitAsync();
//committed
}
//post transaction
At each line I logged the value of the cancelationrequest aswell, and a background task would log when the cancelation request was fired. As a result we got for example this:
pre transaction: False
predequeue: False
dequeued: False
CancelationTokenFired: True
The precise location could vary but the last log before CancelationTokenFired was always
As stated before this is done on the most recent SDK (18-7-2016) which supposedly had a bug-fix for a similar problem. The problem also occurred on the older SDK and even more frequent back then. But even on the new version it is still reproducable each run.
This warning means your service isn't exiting RunAsync when a primary replica of your service is changing role during reconfiguration (look at the health warning in your last screenshot). Make sure you honor that cancellation token in every possible code path. This also applies to communication listeners - make sure they are responding to CloseAsync().
Given what you're saying, here's what most likely happened:
Once reconfiguration completes, your replica set size will be reduced back to your target of 3.
We won't kill your slow replica because we don't know that your application will be fine - maybe it's taking a long time to safely process valuable data - we don't know. Service Fabric is very paranoid about safety and won't do anything that could possibly cause your service to lose data.
Service Fabric Explorer unfortunately doesn't show the reconfiguring state, it is showing you the expected end result. But if you run Get-ServiceFabricPartition in PowerShell, it will show you the reconfiguring state of the partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With