In a producer-consumer web application, what should be the thought process to create a partition key for a kinesis stream shard. Suppose, I have a kinesis stream with 16 shards, how many partition keys should I create? Is it really dependent on the number of shards?
A partition key is used to group data by shard within a stream. Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given data record belongs to.
The throughput of a Kinesis data stream is designed to scale without limits. The default shard quota is 500 shards per stream for the following AWS Regions: US East (N. Virginia), US West (Oregon), and Europe (Ireland). For all other Regions, the default shard quota is 200 shards per stream.
Currently, you scale an Amazon Kinesis Data Stream shard programmatically. Alternatively, you can use the Amazon Kinesis Scaling Utilities. To do so, you can use each utility manually, or automated with an AWS Elastic Beanstalk environment.
Each shard can support up to a maximum total data read rate of 2 MB per second via GetRecords. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.
Partition (or Hash) Key: starts from 1 up to 340282366920938463463374607431768211455. Lets say ~34020 * 10^34, I will omit 10^34 for ease...
If you have 30 shards, uniformly divided, each should cover 1134 * 10^34 hash keys. The coverage should be like this.
Shard-00: 0 - 1134
Shard-01: 1135 - 2268
Shard-03: 2269 - 3402
Shard-04: 3403 - 4536
...
Shard-28: 30619 - 31752
Shard-29: 31753 - 32886
Shard-30: 32887 - 34020
And if you have 3 consumer applications (listening to these 30 shards) each should listen 10 shards (optimum balanced).
This also explains Merge and Split operations on a Stream.
Shard-31: 0 - 567
Shard-32: 568 - 1134
Shard-01: 1135 - 2268
Shard-03: 2269 - 3402
Shard-04: 3403 - 4536
...
Shard-28: 30619 - 31752
Shard-29: 31753 - 32886
Shard-30: 32887 - 34020
See, Shard-00 will no longer accept new data. The new records that are put in Kinesis stream with the same partition key range (as Shard-00) will be placed under Shard-31 or Shard-32.
While sending data to Kinesis (ie. producer side), you should not worry about "which shard the data goes to". Sending a random number (or uuid, or current timestamp in millis) would be best for scaling and distributing the data effectively on shards. Unless you are worried about the ordering of records in a single shard, it is best to choose a random number/constantly changing partition key for put_record request.
In Java you can use "putRecordsRequestEntry.setPartitionKey(Long.toString(System.currentTimeMillis()))
" or "putRecordRequest.setPartitionKey(Long.toString(System.currentTimeMillis()))
" can be examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With