Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The right way to make DynamoDB data searchable in Athena?

I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour)

I am aware of the following two approaches:

  1. Use EMR (from Data Pipeline) to export the entire DynamoDb

    Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources.

  2. Use DynamoDB Streams to reflect any changes in DynamoDb on S3.

    This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files.

Are there any other approaches which can do better than these?

like image 485
siberiancrane Avatar asked Oct 18 '18 04:10

siberiancrane


People also ask

How do you query DynamoDB with Athena?

Navigate to the Data Sources tab of the Athena console and choose "Connect data source" button. In first step of the Data Sources wizard, select the "Query a data source" option, then select "Amazon DynamoDB", then click the next button.

Does Athena work with DynamoDB?

The Amazon Athena DynamoDB connector enables Amazon Athena to communicate with DynamoDB so that you can query your tables with SQL.

Can AWS glue read from DynamoDB?

The DynamoDB export feature allows exporting table data to Amazon S3 across AWS accounts and AWS Regions. After the data is uploaded to Amazon S3, AWS Glue can read this data and write it to the target table.


1 Answers

I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity).

One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file.

If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3.

All the actors will be:

a Lambda processing streams that should:

  • write newly added items to Parquet files in S3,
  • write updates (even PutItem on an existing item) and deletions to a DynamoDB table;

a Glue Job running less frequently that should:

  • consolidate the many smaller files created by the first lambda into fewer bigger parquets,
  • merge all update/delete operations logged in the DynamoDB table to the resulting parquet.

This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)

like image 161
tom_139 Avatar answered Oct 16 '22 20:10

tom_139