I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour) I am aware of the following two approaches: <ol> <li> Use EMR (from Data Pipeline) to export the entire DynamoDb Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources. </li> <li> Use DynamoDB Streams to reflect any changes in DynamoDb on S3. This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files. </li> </ol> Are there any other approaches which can do better than these?

I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity). One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file. If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3. All the actors will be: a Lambda processing streams that should: <ul> <li>write newly added items to Parquet files in S3,</li> <li>write updates (even PutItem on an existing item) and deletions to a DynamoDB table;</li> </ul> a Glue Job running less frequently that should: <ul> <li>consolidate the many smaller files created by the first lambda into fewer bigger parquets,</li> <li>merge all update/delete operations logged in the DynamoDB table to the resulting parquet.</li> </ul> This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)

The right way to make DynamoDB data searchable in Athena?

Tags:

amazon-dynamodb

amazon-athena

I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour)

I am aware of the following two approaches:

Use EMR (from Data Pipeline) to export the entire DynamoDb

Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources.
Use DynamoDB Streams to reflect any changes in DynamoDb on S3.

This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files.

Are there any other approaches which can do better than these?

485

asked Oct 18 '18 04:10

siberiancrane

1 Answers

I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity).

One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file.

If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3.

All the actors will be:

a Lambda processing streams that should:

write newly added items to Parquet files in S3,
write updates (even PutItem on an existing item) and deletions to a DynamoDB table;

a Glue Job running less frequently that should:

consolidate the many smaller files created by the first lambda into fewer bigger parquets,
merge all update/delete operations logged in the DynamoDB table to the resulting parquet.

This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)

161

answered Oct 16 '22 20:10

tom_139

Related questions
                            
                                Single table db architecture with AWS Amplify
                            
                                What is the benefit of a Key-Value Store over Bigtable?
                            
                                When should one use each of the various DynamoDB API clients?
                            
                                DynamoDBMapper Table Generation Creating Only Index
                            
                                How to set Sort key as Global secondary index (Another Partition key) in DynamoDB..?
                            
                                import com.amazonaws.services.dynamodbv2.document.DynamoDB; the document part of the import cannot be resolved
                            
                                Update or create nested element in dynamoDB
                            
                                Equivalent of ItemUtils.toAttributeValue in DynamoDB JDK 2.X?
                            
                                DynamoDb query with filter expression in Java
                            
                                Retrieving a StringSet value using DynamoDB.DocumentClient in Node
                            
                                How to structure a DynamoDB database to allow queries for trending posts?
                            
                                r language support for AWS DynamoDB [duplicate]
                            
                                Getting recent N items from DynamoDB
                            
                                How should I support offline mode using DynamoDB and iOS?
                            
                                JsonMappingException when run as junit at Eclipse Amazon lambda function
                            
                                Optional map variables in terraform module
                            
                                cognito user pool custom attribute in IAM Policy Conditions with Dynamodb Fine grained access

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With