Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is partition key in AWS Kinesis all about?

I was reading about AWS Kinesis. In the following program, I write data into the stream named TestStream. I ran this piece of code 10 times, inserting 10 records into the stream.

var params = {
    Data: 'More Sample data into the test stream ...',
    PartitionKey: 'TestKey_1',
    StreamName: 'TestStream'
};

kinesis.putRecord(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
});

All the records were inserted successfully. What does partition key really mean here? What is it doing in the background? I read its documentation but did not understand what it meant.

like image 361
Suhail Gupta Avatar asked Jan 23 '18 10:01

Suhail Gupta


People also ask

What is shard in Kinesis data stream?

A shard has a sequence of data records in a stream. It serves as a base throughput unit of a Kinesis data stream. A shard supports 1 MB/second and 1,000 records per second for writes and 2 MB/second for reads.

What is dynamic partitioning in AWS?

Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id ) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.


2 Answers

Partition keys only matter when you have multiple shards in a stream (but they're required always). Kinesis computes the MD5 hash of a partition key to decide what shard to store the record on (if you describe the stream you'll see the hash range as part of the shard decription).

So why does this matter?

Each shard can only accept 1,000 records and/or 1 MB per second (see PutRecord doc). If you write to a single shard faster than this rate you'll get a ProvisionedThroughputExceededException.

With multiple shards, you scale this limit: 4 shards gives you 4,000 records and/or 4 MB per second. Of course, there are caveats.

The biggest is that you must use different partition keys. If all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. How you solve this depends on your application: if you're writing from multiple processes then it might be sufficient to use the process ID, server's IP address, or hostname. If you're writing from a single process then you can either use information that's in the record (for example, a unique record ID) or generate a random string.

Second caveat is that the partition key counts against the total write size, and is stored in the stream. So while you could probably get good randomness by using some textual component in the record, you'd be wasting space. On the other hand, if you have some random textual component, you could calculate your own hash from it and then stringify that for the partition key.

Lastly, if you're using PutRecords (which you should, if you're writing a lot of data), individual records in the request may be rejected while others are accepted. This happens because those records went to a shard that was already at its write limits, and you have to re-send them (after a delay).

like image 185
kdgregory Avatar answered Oct 08 '22 23:10

kdgregory


The accepted answer explains what are partition keys and and what they're used for in Kinesis (to decide to which shard to send the data to). Unfortunately, it does not explain why partition keys are needed in the first place.

In theory AWS could create a random partition key for each record which will result a near-perfect spread.

The real reason partitions are used is for "ordering/streaming". Kinesis maintains ordering (sequence number) for each shard.

In other words, by streaming X and afterwards Y to shard Z it is guaranteed, that X will be pulled from the stream before Y (when pulling records from all shards). On the other hand, by streaming X to shard Z1 and afterwards Y to shard Z2 there is no guarantee on the ordering (when pulling records from all shards). Y may definitely be pulled before X.

The shard "streaming" capability is useful in many cases.

(E.g. a video service streaming a movie to a user using the username and the movie name as the partition key).

(E.g. working on a stream of common events, and applying aggregation).

In cases where ordering (streaming) or grouping (e.g aggregation) is not required, generating a random partition key will suffice.

like image 35
Tomer Avatar answered Oct 08 '22 21:10

Tomer