Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use ExplicitHashKey for round robin stream assignment in AWS Kinesis

I am trying to pump lots of data through Amazon Kinesis (order 10,000 points per second).

In order to maximize records per second through my shards, I'd like to round robin my requests over the shards (my application logic doesn't care what shard individual messages go to).

It would seem I could do this with the ExplicitHashKey parameter for the messages in the list I am sending to the PutRecords endpoint - however the Amazon documentation doesn't actually describe how to use ExplicitHashKey, other than the oracular statement of:

http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

Each record in the Records array may include an optional parameter, ExplicitHashKey, which overrides the partition key to shard mapping. This parameter allows a data producer to determine explicitly the shard where the record is stored. For more information, see Adding Multiple Records with PutRecords in the Amazon Kinesis Streams Developer Guide.

(The statement in the docs above has a link to another section of the documentation, which does not discuss ExplicitHashKeys at all).

Is there a way to use ExplicitHashKey to round robin data among shards?

What are valid values for the parameter?

like image 753
deadcode Avatar asked Jun 16 '17 16:06

deadcode


People also ask

How does a Kinesis data streams distribute data to different shards?

Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given data record belongs to. Partition keys are Unicode strings, with a maximum length limit of 256 characters for each key.

How do I post data to Kinesis stream?

To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard.

What is ShardId in Kinesis?

ShardId. The unique identifier of the shard within the stream. Type: String.

Can you have multiple consumers on Kinesis stream?

A Kinesis data stream is a set of shards. There can be multiple consumer applications for one stream, and each application can consume data independently and concurrently.


1 Answers

Each shard is assigned a sequential range of 128 bit integers from 0 to 2^128 - 1.

You may find the range of integers assigned to a given shard in a stream via the AWS CLI:

aws kinesis describe-stream --stream-name name-of-your-stream

The output will look like:

{
    "StreamDescription": {
        "RetentionPeriodHours": 24, 
        "StreamStatus": "ACTIVE", 
        "StreamName": "name-of-your-stream", 
        "StreamARN": "arn:aws:kinesis:us-west-2:your-stream-info", 
        "Shards": [
           {
                "ShardId": "shardId-000000000113", 
                "HashKeyRange": {
                    "EndingHashKey": "14794885518301672324494548149207313541", 
                    "StartingHashKey": "0"
                }, 
                "ParentShardId": "shardId-000000000061", 
                "SequenceNumberRange": {
                    "StartingSequenceNumber": "49574208032121771421311268772132530603758174814974510866"
                }
            }, 
           { ... more shards ... }
       ...

You may set the ExplicitHashKey of a record to the string decimal representation of an integer value anywhere in the range of hash keys for a shard to force it to be sent to that particular shard.

Note that due to prior merge and split operations on your shard, there may be many shards with overlapping HashKeyRanges. The currently open shards are the ones that do not have a SequenceNumberRange.EndingSequenceNumber element.

You can round robin requests among a set of shards by identifying an 128 bit integer within the range of each of the shards of interest, and round robin assigning the string representation of that number to each record's ExplicitHashKey.

As a side note, you can also calculate the hash value a given PartitionKey will evaluate to by:

  1. Compute the MD5 sum of the partition key.
  2. Interpret the MD5 sum as a hexadecimal number and convert it to base 10. This will the the hash key for that partition key. You can then look up what shard that hash key falls into.
like image 130
deadcode Avatar answered Jan 04 '23 04:01

deadcode