How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support. Lets say we are building a tech blog where users can post articles. And say articles can be tagged. <pre class="prettyprint"><code>user { id : 1235, name : "John", ... } article { id : 789, title: "dynamodb use cases", author : 12345 //userid tags : ["dynamodb","aws","nosql","document database"] } </code></pre> In the user interface we want to show for the current user tags and the respective count. How to achieve the following aggregation? <pre class="prettyprint"><code>{ userid : 12, tag_stats:{ "dynamodb" : 3, "nosql" : 8 } } </code></pre> We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page. <ul> <li>I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted </li> <li>Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.</li> </ul> I would like to know other and better ways of achieving the same. How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality. You have three main options: <ul> <li>Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.</li> <li>Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.</li> <li>Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.</li> </ul> Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers. Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

Dynamodb is pure key/value storage and does not support aggregation out of the box. If you really want to do aggregation using DynamoDB here some hints. For you particular case lets have table named <code>articles</code>. To do aggregation we need an extra table <code>user-stats</code> holding <code>userId</code> and <code>tag_starts</code>. <ol> <li>Enabled DynamoDB streams on table <code>articles</code> </li> <li>Create a new lambda function <code>user-stats-aggregate</code> which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over <code>articles</code> table.</li> <li>Lambda will perform following logic</li> </ol> <ul> <li>If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in <code>user-stats</code> this user)</li> <li>If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.</li> </ul> <ol start="4"> <li>Stand an API service retrieving these user stats.</li> </ol> Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...) This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.

How to do basic aggregation with DynamoDB?

Tags:

nosql

amazon-dynamodb

nosql-aggregation

amazon-dynamodb-streams

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.

Lets say we are building a tech blog where users can post articles. And say articles can be tagged.

user {     id : 1235,     name : "John",     ... }  article {     id : 789,     title: "dynamodb use cases",     author : 12345 //userid     tags : ["dynamodb","aws","nosql","document database"] }

In the user interface we want to show for the current user tags and the respective count.

How to achieve the following aggregation?

{     userid : 12,     tag_stats:{         "dynamodb" : 3,         "nosql" : 8     } }

We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.

I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.

I would like to know other and better ways of achieving the same. How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

583

asked May 24 '17 06:05

prem kumar

2 Answers

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.

You have three main options:

Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.

Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.

Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

answered Sep 30 '22 09:09

Ivan Mushketyk

Dynamodb is pure key/value storage and does not support aggregation out of the box.

If you really want to do aggregation using DynamoDB here some hints.

For you particular case lets have table named articles. To do aggregation we need an extra table user-stats holding userId and tag_starts.

Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic

If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.

Stand an API service retrieving these user stats.

Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)

This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.

answered Sep 30 '22 10:09

Traycho Ivanov

Related questions
                            
                                NoSQL Solution for Persisting Graphs at Scale
                            
                                Using a Filesystem (Not a Database!) for Schemaless Data - Best Practices
                            
                                Am I missing something about Document Databases?
                            
                                What does it mean that Azure Cosmos DB is multi-model?
                            
                                WHERE clause on an array in Azure DocumentDb
                            
                                What .NET-compatible graph database solution(s) have a proven track record?
                            
                                Graph DBs vs. Document DBs vs. Triplestores
                            
                                Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL?
                            
                                MongoDB: How to get distinct list of sub-document field values?
                            
                                Does MongoDB support floating point types?
                            
                                Redis,distributed or not?
                            
                                Mongoose populate embedded
                            
                                Do NoSQL databases use or need indexes?
                            
                                Should I use redis to store a large number of binary files? [closed]
                            
                                File Storage for Web Applications: Filesystem vs DB vs NoSQL engines
                            
                                Database EAV Pros/Cons and Alternatives
                            
                                CouchDB sorting and filtering in the same view
                            
                                Are there any REAL advantages to NoSQL over RDBMS for structured data on one machine?
                            
                                Is an ORM redundant with a NoSQL API?
                            
                                Cassandra - transaction support

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With