I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour)
I am aware of the following two approaches:
Use EMR (from Data Pipeline) to export the entire DynamoDb
Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources.
Use DynamoDB Streams to reflect any changes in DynamoDb on S3.
This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files.
Are there any other approaches which can do better than these?
Navigate to the Data Sources tab of the Athena console and choose "Connect data source" button. In first step of the Data Sources wizard, select the "Query a data source" option, then select "Amazon DynamoDB", then click the next button.
The Amazon Athena DynamoDB connector enables Amazon Athena to communicate with DynamoDB so that you can query your tables with SQL.
The DynamoDB export feature allows exporting table data to Amazon S3 across AWS accounts and AWS Regions. After the data is uploaded to Amazon S3, AWS Glue can read this data and write it to the target table.
I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity).
One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file.
If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3.
All the actors will be:
a Lambda processing streams that should:
a Glue Job running less frequently that should:
This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With