Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the limits on actorevents in service fabric?

I am currently testing the scaling of my application and I ran into something I did not expect.

The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model. For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue). The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment. The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).

At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out. But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).

The eventviewer is full with warnings and errors like:

EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc

I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.

So there are a couple of questions

  • Are there limits known/advised for actor events?
  • Would increasing the partition count or/and node count help here?
  • Is the communication interference logical? Why are other service endpoints having issues as well?
like image 227
P. Gramberg Avatar asked Feb 26 '21 08:02

P. Gramberg


1 Answers

After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.

The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe. You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).

To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.

  1. Register for endpoint change notifications for the actor service
           fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
            {
                var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
                /*
                 * Add additional logic for optimizations
                 * - check if the endpoint is not empty
                 * - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
                 * Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
                 */
                List<long> keys;
                if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
                {
                    foreach (var key in keys)
                    {
                        // 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
                        // 2. Resubscribe by calling ActorProxy.SubscribeAsync()
                    }
                }
            };

            await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
  1. Change the resubscription interval to a value which fits your need. Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
              await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
              ResolvedServicePartition rsp;
              ((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
              var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
       keys.Add(communicationId);

The above approach ensures the below

  • The subscriptions are resubscribed at regular intervals
  • If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback

This ends the psuedo code form the support call.

Answering my original questions:

  • Are there limits known/advised for actor events? No hard limits, only resource usage.
  • Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
  • Is the communication interference logical? Why are other service endpoints having issues as well? Yes, resource contention is the reason.
like image 171
P. Gramberg Avatar answered Nov 13 '22 21:11

P. Gramberg