We are experiencing lots of these exceptions sending events to EventHubs during peak traffic: "Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id." or "Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time " You can see it clearly here: <img src="https://i.stack.imgur.com/ZSbzL.png" alt="Azure Portal EH"> As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput. Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy. Is it anything we still need to tweak here? We are really concerned about the scalability of EH. Thanks

Send throughput tuning can be achieved using efficient partition distribution strategies. There isn't any single knob which can do this. Below is the basic information you will need to be able to design for High-Thruput Scenarios. 1) Lets start from the Namespace: Throughput Units(aka TUs) are configured at Namespace level. Pls. bear in mind, that, TUs configured is applied - aggregate of all EventHubs under that Namespace. If you have 5 TUs on your Namespace and 5 eventhubs under it - it will be divided among all 5 eventhubs. 2) Now lets look at EventHub level: If the EventHub is allocated with 5 TUs and it has 32 partitions - No single partition can use all 5 TUs. For ex. if you are trying to send 5TU of data to 1 partition and 'Zero' to all other 31 partitions - this is not possible. Maximum you should plan per Partition is 1 TU. In general, you will need to ensure that the data is distributed evenly across all partitions. EventHubs support 3 types of sends - which gives users different level of control on Partition distribution: <ol> <li>EventHubClient.Send(EventDataWithoutPartitionKey) -> if you are using this API to send - eventhub will take care of evenly distributing the data across all partitions. EventHubs service gateway will round-robin the data to all partitions. When a specific partition is down - the Gateways auto-detect and ensure Clients doesn't see any impact. This is the most recommended way to Send to EventHubs.</li> <li>EventHubClient.Send(EventDataWithPartitionKey) -> if you are using this API to send to EventHubs - the partitionKey will determine the distribution of your data. PartitionKey is used to Hash the EventData to the appropriate partition (algo. to hash is Microsoft Proprietary and not Shared). Typically users who require correlation of a group of messages will use this variant of Send.</li> <li>EventHubSender.Send(EventData) -> In this variant, the Sender is already attached to the Partition. So - this gives complete control of Distribution across partitions to the Client.</li> </ol> To measure your present distribution of Data - use EventHubClient.GetPartitionRuntimeInfo Api to estimate which Partition is overloaded. The difference b/w <code>BeginSequenceNumber</code> and <code>LastEnqueuedSequenceNumber</code> is supposed to give an estimate of that partitions load compared to others. 3) Last but not the least - you can tune performance (not Throughput) at send operation level - using the SendBatch API. 1 TU can buy a Max of 1000 msgs/sec or 1MBPS - you will be throttled with whichever limit hits first - this cannot be changed. If your messages are small - lets say 100 bytes and you can send only 1000 msgs/sec (as per the TU limit) - you will first hit the 1000 events/sec limit. However, overall using SendBatch API - you can batch lets say 10 of 100byte msgs and push at the same rate - 1000 msgs/sec with just 100 API calls and improve the end-to-end latency of the system (as it helps service also to persist messages efficiently). Remember, the only limitation here is the Max. Msg Size that can be sent - which is 256 kb (this limit will apply on your BatchSize if you use SendBatch API). Given that background, in your case: - Having 32 partitions and 5 TUs - I would really double-check the Partition distribution strategy. here's some more general reading on Event Hubs...

High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

Tags:

azure

azureservicebus

azure-eventhub

We are experiencing lots of these exceptions sending events to EventHubs during peak traffic:

"Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id." or "Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time "

You can see it clearly here:

Azure Portal EH

As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput.

Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy.

Is it anything we still need to tweak here? We are really concerned about the scalability of EH.

Thanks

593

asked Nov 11 '15 23:11

Jose Parra

2 Answers

Send throughput tuning can be achieved using efficient partition distribution strategies. There isn't any single knob which can do this. Below is the basic information you will need to be able to design for High-Thruput Scenarios.

1) Lets start from the Namespace: Throughput Units(aka TUs) are configured at Namespace level. Pls. bear in mind, that, TUs configured is applied - aggregate of all EventHubs under that Namespace. If you have 5 TUs on your Namespace and 5 eventhubs under it - it will be divided among all 5 eventhubs.

2) Now lets look at EventHub level: If the EventHub is allocated with 5 TUs and it has 32 partitions - No single partition can use all 5 TUs. For ex. if you are trying to send 5TU of data to 1 partition and 'Zero' to all other 31 partitions - this is not possible. Maximum you should plan per Partition is 1 TU. In general, you will need to ensure that the data is distributed evenly across all partitions. EventHubs support 3 types of sends - which gives users different level of control on Partition distribution:

EventHubClient.Send(EventDataWithoutPartitionKey) -> if you are using this API to send - eventhub will take care of evenly distributing the data across all partitions. EventHubs service gateway will round-robin the data to all partitions. When a specific partition is down - the Gateways auto-detect and ensure Clients doesn't see any impact. This is the most recommended way to Send to EventHubs.
EventHubClient.Send(EventDataWithPartitionKey) -> if you are using this API to send to EventHubs - the partitionKey will determine the distribution of your data. PartitionKey is used to Hash the EventData to the appropriate partition (algo. to hash is Microsoft Proprietary and not Shared). Typically users who require correlation of a group of messages will use this variant of Send.
EventHubSender.Send(EventData) -> In this variant, the Sender is already attached to the Partition. So - this gives complete control of Distribution across partitions to the Client.

To measure your present distribution of Data - use EventHubClient.GetPartitionRuntimeInfo Api to estimate which Partition is overloaded. The difference b/w BeginSequenceNumber and LastEnqueuedSequenceNumber is supposed to give an estimate of that partitions load compared to others.

3) Last but not the least - you can tune performance (not Throughput) at send operation level - using the SendBatch API. 1 TU can buy a Max of 1000 msgs/sec or 1MBPS - you will be throttled with whichever limit hits first - this cannot be changed. If your messages are small - lets say 100 bytes and you can send only 1000 msgs/sec (as per the TU limit) - you will first hit the 1000 events/sec limit. However, overall using SendBatch API - you can batch lets say 10 of 100byte msgs and push at the same rate - 1000 msgs/sec with just 100 API calls and improve the end-to-end latency of the system (as it helps service also to persist messages efficiently). Remember, the only limitation here is the Max. Msg Size that can be sent - which is 256 kb (this limit will apply on your BatchSize if you use SendBatch API).

Given that background, in your case: - Having 32 partitions and 5 TUs - I would really double-check the Partition distribution strategy.

here's some more general reading on Event Hubs...

195

answered Oct 13 '22 00:10

Sreeram Garlapati

After a lot of digging we decided to stop setting the PK for posted messages, and the issue simply went away!. We were using GUID as PK. We start to get very few erros on the Azure Portal, and no more exceptions. Hope this helps someone else

answered Oct 13 '22 00:10

Jose Parra

Related questions
                            
                                Entity Framework Code First Azure connection
                            
                                Migrating ASP.NET Membership Database to SQL Azure
                            
                                Combined Azure web role and worker role project not seeing app.config when deployed
                            
                                Connect to tfs Azure from visual studio
                            
                                How to run RavenDb in Azure within a worker role
                            
                                Secure WebAPI with a JWT
                            
                                What is the difference between Azure Web Site and Azure Cloud service
                            
                                Private Ports in Azure Virtual Machine
                            
                                Azure cloud service - Does VIP swap cause OnStop() to be invoked?
                            
                                Azure Websites PHP API - 405 Method Not Allowed on PUT and DELETE
                            
                                Change cloud service region
                            
                                Connecting a mvc 5 application to ACS in Azure?
                            
                                Static linking in C#?
                            
                                Using RSACryptoServiceProvider on Azure web site results in file not found error
                            
                                When should an Azure website be restarted, and what are the consequences?
                            
                                How to hide warnings from Azure powershell command lets
                            
                                Azure Table Storage - TableEntity map column with a different name
                            
                                Create Resource Group with Azure Management C# API
                            
                                Does Azure SQL Database support In Memory Optimized Tables?
                            
                                Unable to get bearer token from Azure AD to use with API App

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

Tags:

azure

azureservicebus

azure-eventhub

Jose Parra

People also ask

2 Answers

Sreeram Garlapati

Jose Parra

Recent Activity

Donate For Us