Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop an idle Service Fabric Cluster Upgrade?

I have a service fabric cluster that seems to be stuck in the roll back phase of an automatic upgrade for over seven days.

This is the output from Get-ServiceFabricClusterUpgrade:

TargetCodeVersion             : 5.5.216.0
TargetConfigVersion           : 2
StartTimestampUtc             : 15/06/2017 23:44:40
FailureTimestampUtc           : 16/06/2017 01:41:48
FailureReason                 : HealthCheck
UpgradeState                  : RollingBackInProgress
UpgradeDuration               : 7.14:13:10
CurrentUpgradeDomainDuration  : 7.12:16:03
CurrentUpgradeDomainProgress  : 0

NodeName            : xxxxxxxxxxxxxxxxxxxxx
UpgradePhase        : PreUpgradeSafetyCheck
PendingSafetyChecks :
WaitForInbuildReplica - PartitionId: xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx
NextUpgradeDomain             : 1
UpgradeDomainsStatus          : { "0" = "InProgress";
                                  "1" = "Pending";
                                  "2" = "Pending";
                                  "3" = "Pending";
                                  "4" = "Pending" }

The only other cmdlets under the Service Fabric powershell module that seem related are Start-ServiceFabricClusterUpgrade, Resume-ServiceFabricClusterUpgrade and Update-ServiceFabricClusterUpgrade.

I have tried Start-ServiceFabricClusterUpgrade with the -Force switch hoping it would cancel the existing hanging one, and start a new one but unfortunately not. I have also restarted the node that is in progress but that has made no difference either.

In the absence of a Stop-ServiceFabricClusterUpgrade, is there anything else I can do to stop this process?

like image 243
Declan McNulty Avatar asked Jun 23 '17 14:06

Declan McNulty


3 Answers

Troubleshoot application upgrades says that -

"An UpgradePhase of PreUpgradeSafetyCheck means there were issues preparing the upgrade domain before it was performed.The most common issues in this case are service errors in the close or demotion from primary code paths."

So probably SF was not able to shut down service executable. The easiest way might be to Deactivate(restart) the node mentioned in the output from the SF Explorer.

like image 176
Kiryl Avatar answered Oct 19 '22 11:10

Kiryl


What I did in the end was log onto the nodes in the cluster one by one and restart them, waiting for the previous one to come back up before restarting the next one.

This fixed it and the upgrade process eventually finished. The restart on the VMSS would probably have achieved the same thing, but I'm not sure whether there would have been a service outage during the restart. It certainly would have been less time consuming.

like image 2
Declan McNulty Avatar answered Oct 19 '22 11:10

Declan McNulty


Two ways that I can see you accomplishing this:

  • Kill the service fabric cluster and recreate it
  • or preferably restart the Virtual Machine Scale Set (really the equivalent of restarting the servers). I'm sure there's a way to do this through Powershell instead of through the Azure portal.

enter image description here

like image 1
The Muffin Man Avatar answered Oct 19 '22 10:10

The Muffin Man