I am trying to implement an Event Hub in Azure. I have managed to create a Producer which publishes messages to the Event Hub, as well as a Consumer which reads them off. My Event Hub is divided up into 16 partitions. On the consumer side, I loop through each of these as follows:
var eventHub = NamespaceManager.CreateFromConnectionString(builder.ToString()).GetEventHub("de-analytics-events");
foreach (var partitionId in eventHub.PartitionIds)
{
subscriberGroup.RegisterProcessor<EventProcessor>(new Lease
{
PartitionId = partitionId
}, new EventProcessorCheckpointManager());
Console.WriteLine("Processing: " + partitionId);
}
Looking at these values in a debugger shows that the eventHub.PartitionIds
range from "0" to "15" in the case of 16 partitions.
However, on the producer side, all I was allowed to specify was my EventData.PartitionKey
, which is a string, but which does not directly correspond to the strings on the consumer side. E.g. if I specified a PartitionKey = "7", it did not necessarily write to partition "7".
Reading up shows that some sort of hashing is involved, but I don't particularly want to guess randomly at 16 strings that hash to the numbers 0-15. So I'm wondering how I can define which partition is published to?
For added reference, this is the tutorial I followed to get my simplest case working.
The number of partitions is specified at the time of creating an event hub. It must be between 1 and the maximum partition count allowed for each pricing tier. For the partition count limit for each tier, see this article.
A partition is an ordered sequence of events that is held in an event hub. As newer events arrive, they are added to the end of this sequence. A partition can be thought of as a “commit log.” Event Hubs retains data for a configured retention time that applies across all partitions in the event hub.
Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.
Specifying PartitionKey will ensure that all the events that have the same key are sent to the same partition and there is order maintained for these events within the partition.
Do you have such requirement for your data on the processing side?
If you don't have such requirement then the recommendation is to "not set the PartitionKey". That ways Event hub broker will distribute the events amongst the partitions uniformly.
If you do have the order guarantee requirements for your data within a PartitionKey and you have a small number of publishers then there is manual way of handling the partitions and distributing load using the Partitioned Sender.
Refer to this link on how to use the Partitioned Sender.
http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.eventhubclient.createpartitionedsender.aspx
You're correct, a hash is used to translate the partition key to a given partition. The question I have then, is as long as the hash algorithm distributes events evenly and consistently, why should you really care which partition the message is assigned to?
Yes, you could argue that you want to know so you know who the receiver will be. But the reality is that tight coupling like this makes the solution inherently fragile. You're betting off letting the service do what it needs to do to keep traffic healthy and realize that once you get a message using a given partition key, you're very likely to always get messages using that key.
The bigger challenge is to ensure that the partition key strategy you use is one that will help ensure a fairly even distribution of events across the partitions (aka don't give 10,000 devices all the same partition key).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With