Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do DynamoDB streams distribute records to shards?

My goal is to ensure that records published by a DynamoDB stream are processed in the "correct" order. My table contains events for customers. Hash key is Event ID, range key a timestamp. "Correct" order would mean that events for the same customer ID are processed in order. Different customer IDs can be processed in parallel.

I'm consuming the stream via Lambda functions. Consumers are spawned automatically per shard. So if the runtime decides to shard the stream, consumption happens in parallel (if I get this right) and I run the risk of processing a CustomerAddressChanged event before CustomerCreated (for example).

The docs imply that there is no way to influence the sharding. But they don't say so explicitly. Is there a way, e.g., by using a combination of customer ID and timestamp for the range key?

like image 917
EagleBeak Avatar asked May 30 '17 15:05

EagleBeak


People also ask

How does DynamoDB Sharding work?

Write sharding is a mechanism to distribute a collection across a DynamoDB table's partitions effectively. It increases write throughput per partition key by distributing the write operations for a partition key across multiple partitions.

How does DynamoDB streams work?

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table and stores this information in a log for up to 24 hours. Applications can access this log and view the data items as they appeared before and after they were modified, in near-real time.

What is shards DynamoDB stream?

Shards in DynamoDB streams are collections of stream records. Each stream record represents a single data modification in the DynamoDB table to which the stream belongs. The following diagram shows the relationship between a stream, shards in the stream, and stream records in the shards.

Does DynamoDB use sharding?

One way to better distribute writes across a partition key space in Amazon DynamoDB is to expand the space. You can do this in several different ways. You can add a random number to the partition key values to distribute the items among partitions.


2 Answers

The assumption that sharding is determined by table keys seems to be correct. My solution will be to use customer ID as hash key and timestamp (or event ID) as range key.

This AWS blog says:

The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.

This slide confirms it. I still wish the DynamoDB docs would explicitly say so...

like image 131
EagleBeak Avatar answered Sep 21 '22 13:09

EagleBeak


I just had a response from AWS support. It seems to confirm @EagleBeak assumptions about partitions being mapped into shards. Or as I understand it, a partition is mapped to a shard tree.

My question was about REMOVE events due to TTL expiration, but it would apply to all other types of actions too.

  1. Is a shard created per Primary Partition Key? and then if there are too many items in the same partition, the shard gets split into children?

A shard is created per partition in your DynamoDB table. If a partition split is required due to too many items in the same partition, the shard gets split into children as well. A shard might split in response to high levels of write activity on its parent table, so that applications can process records from multiple shards in parallel.

  • https://aws.amazon.com/blogs/database/dynamodb-streams-use-cases-and-design-patterns/
  • https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
  1. Will those removed 100 items be put in just one shard provided they all have the same partition key?

Assuming all 100 items have the same partition key value (but different sort key values), they would have been stored on the same partition. Therefore, they would be removed from the same partition and be put in the same shard.

  1. Since "records sent to your AWS Lambda function are strictly serialized", how does this serialisation work in the case of TTL? Is order within a shard established by partition/sort keys, TTL expiration, etc.?

DynamoDB Streams captures a time-ordered sequence of item-level modifications in your DynamoDB table. This time-ordered sequence is preserved at a per shard level. In other words, the order within a shard is established based on the order in which items were created, updated or deleted.

  • https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
like image 35
cortopy Avatar answered Sep 18 '22 13:09

cortopy