I am trying to pump lots of data through Amazon Kinesis (order 10,000 points per second).
In order to maximize records per second through my shards, I'd like to round robin my requests over the shards (my application logic doesn't care what shard individual messages go to).
It would seem I could do this with the ExplicitHashKey parameter for the messages in the list I am sending to the PutRecords endpoint - however the Amazon documentation doesn't actually describe how to use ExplicitHashKey, other than the oracular statement of:
http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
Each record in the Records array may include an optional parameter, ExplicitHashKey, which overrides the partition key to shard mapping. This parameter allows a data producer to determine explicitly the shard where the record is stored. For more information, see Adding Multiple Records with PutRecords in the Amazon Kinesis Streams Developer Guide.
(The statement in the docs above has a link to another section of the documentation, which does not discuss ExplicitHashKeys at all).
Is there a way to use ExplicitHashKey to round robin data among shards?
What are valid values for the parameter?
Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given data record belongs to. Partition keys are Unicode strings, with a maximum length limit of 256 characters for each key.
To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard.
ShardId. The unique identifier of the shard within the stream. Type: String.
A Kinesis data stream is a set of shards. There can be multiple consumer applications for one stream, and each application can consume data independently and concurrently.
Each shard is assigned a sequential range of 128 bit integers from 0 to 2^128 - 1.
You may find the range of integers assigned to a given shard in a stream via the AWS CLI:
aws kinesis describe-stream --stream-name name-of-your-stream
The output will look like:
{
"StreamDescription": {
"RetentionPeriodHours": 24,
"StreamStatus": "ACTIVE",
"StreamName": "name-of-your-stream",
"StreamARN": "arn:aws:kinesis:us-west-2:your-stream-info",
"Shards": [
{
"ShardId": "shardId-000000000113",
"HashKeyRange": {
"EndingHashKey": "14794885518301672324494548149207313541",
"StartingHashKey": "0"
},
"ParentShardId": "shardId-000000000061",
"SequenceNumberRange": {
"StartingSequenceNumber": "49574208032121771421311268772132530603758174814974510866"
}
},
{ ... more shards ... }
...
You may set the ExplicitHashKey
of a record to the string decimal representation of an integer value anywhere in the range of hash keys for a shard to force it to be sent to that particular shard.
Note that due to prior merge and split operations on your shard, there may be many shards with overlapping HashKeyRanges
. The currently open shards are the ones that do not have a SequenceNumberRange.EndingSequenceNumber
element.
You can round robin requests among a set of shards by identifying an 128 bit integer within the range of each of the shards of interest, and round robin assigning the string representation of that number to each record's ExplicitHashKey
.
As a side note, you can also calculate the hash value a given PartitionKey
will evaluate to by:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With