What is shards in kinesis data stream and partition key. I read aws documents but I don't get it. Can someone explain it in simple terms?
A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity.
This means the message will be sent to any shard. So shards contains a set of data records. Maximum size of data record can be 1mb and shards can have max 1000 records.
A shard iterator specifies the shard position from which to start reading data records sequentially. The position is specified using the sequence number of a data record in a shard.
From Amazon Kinesis Data Streams Terminology and Concepts - Amazon Kinesis Data Streams:
A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.
So, a shard has two purposes:
If your application must process all messages in order, then you can only use one shard. Think of it as a line at a bank — if there is one line, then everybody gets served in order.
However, if messages only need to be ordered for a certain subset of messages, they can be sent to separate shards. For example, multiple lines in a bank, where each line gets served in order. Or, think of a bus sending GPS coordinates. Each bus sends messages to only a single shard. A shard might contain messages from multiple buses, but each bus only sends to one shard. This way, when the messages from that shard is processed, all messages from a particular bus are processed in order.
This is controlled by using a Partition Key, which identifies the source. The partition key is hashed and assigned to a shard. Thus, all messages with the same partition key will go to the same shard.
At the back-end, there is a typically one worker per shard that is processing the messages, in order, from that shard.
If your system does not care about preserving message order, then use a random partition key. This means the message will be sent to any shard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With