What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

Tags:

service-fabric-on-premises

I am having trouble with Service Fabric trying to place too many services onto a single node too fast.

To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas. The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.

The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config. As examples of what I have tried: I have turned on InBuildThrottlingEnabled and set InBuildThrottlingGlobalMaxValue to 100, I have set the Global Movement Throttle settings to be various percentages of the total application count.

At this point there are two distinct scenarios I am trying to solve for. In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.

1st: Starting an entire cluster from all nodes being off without overwhelming nodes.

2nd: A single node being overwhelmed by too many services starting after a host comes back online

Here are my current parameters on the cluster:

       "Name": "PlacementAndLoadBalancing",
       "Parameters": [
         {
           "Name": "UseMoveCostReports",
           "Value": "true"
         },
         {
           "Name": "PLBRefreshGap",
           "Value": "1"
         },
         {
           "Name": "MinPlacementInterval",
           "Value": "30.0"
         },
         {
           "Name": "MinLoadBalancingInterval",
           "Value": "30.0"
         },
         {
           "Name": "MinConstraintCheckInterval",
           "Value": "30.0"
         },
         {
           "Name": "GlobalMovementThrottleThresholdForPlacement",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleThresholdForBalancing",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleThreshold",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleCountingInterval",
           "Value": "450"
         },
         {
           "Name": "InBuildThrottlingEnabled",
           "Value": "false"
         },
         {
           "Name": "InBuildThrottlingGlobalMaxValue",
           "Value": "100"
         }
       ]
     },

Based on discussion in answer below, wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here. Green node goes down, then purple goes down due to too many resources being shuffled onto it.

A graph demonstrating the above. Green goes down, then purple behind it

508

asked Jun 24 '20 15:06

George Whiting

1 Answers

From SF's perspective, 1 & 2 are the same problem. Also as a note, SF doesn't evict a node just because CPU consumption is high. So: "The nodes go to 100% for an amount of time such that service fabric declares the node as down." needs some more explanation. The machines might be failing for other reasons, or I guess could be so loaded that the kernel level failure detectors can't ping other machines, but that isn't very common.

For config changes: I would remove all of these to go with the defaults

 {
   "Name": "PLBRefreshGap",
   "Value": "1"
 },
 {
   "Name": "MinPlacementInterval",
   "Value": "30.0"
 },
 {
   "Name": "MinLoadBalancingInterval",
   "Value": "30.0"
 },
 {
   "Name": "MinConstraintCheckInterval",
   "Value": "30.0"
 },

For the inbuild throttle to work, this needs to flip to true:

     {
       "Name": "InBuildThrottlingEnabled",
       "Value": "false"
     },

Also, since these are likely constraint violations and placement (not proactive rebalancing) we need to explicitly instruct SF to throttle those operations as well. There is config for this in SF, although it is not documented or publicly supported at this time, you can see it in the settings. By default only balancing is throttled, but you should be able to turn on throttling for all phases and set appropriate limits via something like the below.

These first two settings are also within PlacementAndLoadBalancing, like the ones above.

 {
   "Name": "ThrottlePlacementPhase",
   "Value": "true"
 },
 {
   "Name": "ThrottleConstraintCheckPhase",
   "Value": "true"
 },

These next settings to set the limits are in their own sections, and are a map of the different node type names to the limit you want to throttle for that node type.

{
"name": "MaximumInBuildReplicasPerNodeConstraintCheckThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNodePlacementThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNodeBalancingThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNode",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
}

I would make these changes and then try again. Additional information like what is actually causing the nodes to be down (confirmed via events and SF health info) would help identify the source of the problem. It would probably also be good to verify that starting 100 instances of the apps on the node actually works and whether that's an appropriate threshold.

173

answered Oct 14 '22 02:10

masnider

Related questions
                            
                                Azure Service Bus/Service Fabric message not being removed from queue
                            
                                Operation timed out publishing Service Fabric application to Azure
                            
                                Fabric Message is too large
                            
                                No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue
                            
                                Add a new service to existing service fabric application
                            
                                Add Application Insight to a existing Azure Service Fabric cluster
                            
                                IdentityServer4 Error: No DbContext named 'ConfigurationDbContext' was found
                            
                                Running Multiple Application Instances and Configurations Side by Side on an Azure Service Fabric Cluster
                            
                                Can't find the Connect-ServiceFabricCluster cmdlet when using Powershell
                            
                                FabricConnectionDeniedException - Where do I setup Azure Service Fabric connections?
                            
                                Design of Application in Azure Service Fabric
                            
                                Service Fabric - Difference between using app.config and settings.xml?
                            
                                Advantages of Service Fabric Microservices vs Collection of Azure Cloud services/web apps
                            
                                Environment variables not being used when debugging through a Service Fabric project
                            
                                How do I deploy service fabric application from VSTS release pipeline?
                            
                                Behavior difference between Actor and Service projects in Azure Service Fabric
                            
                                Scale Service Fabric based on service bus queue size
                            
                                Publish to Service Fabric Cluster fails because of Test-ServiceFabricClusterConnection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

Tags:

azure-service-fabric

service-fabric-on-premises

George Whiting

People also ask

1 Answers

masnider

Recent Activity

Donate For Us