Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copying DynamoDB table to another DynamoDB table with transforms

I have two DynamoDB tables: Table_1 and Table_2. I am trying to deprecate Table_1 and copy information into Table_2 from Table_1, which has different GSIs and different LSIs.

Table_1 attributes are: Id, state, isReused, empty, normal

Table_2 attributes are: UserId, Status, isOld, normal

Id maps to UserId, state maps to status, normal maps to normal, empty is dropped from Table_2, and if the state is "OLD" then isOld sets to true.

What is the best way to export this data from Table_1, do the transform on the attributes/data, and then load the information back into Table_2?

Currently, I am able to use AWS Data Pipeline to import/export data from Table_1 to Table_2 with the given templates, but this does not do the transforms. I'm guessing that I need to use EMR to do the transforms.

I also use DynamoDB streams to keep the table in sync, but from my understanding, DynamoDB streams only streams updated information, not information that already exists in a table.

like image 953
nat Avatar asked Mar 12 '23 04:03

nat


2 Answers

Assuming that you need this data movement only once, I can think of two options:

  • Spawn an EMR cluster with Hive installed. Create two Hive tables, one each on Table_1 and Table_2. Then, do something like 'insert into Table_2 select ... from Table_1'. This will give you full control on the transformations you need.
  • It is true that only updated and new items are sent to a DynamoDB stream. If you already have a syncer setup, why don't you 'update' all existing items in Table_1 once? Just add a dummy field to all of them, so that they are pushed to Table_1's stream, after which they'll be automatically copied to Table_2! Just make sure that your syncer ignores this dummy field. Advantages of this approach:
    • Will handle race conditions well.
    • Based on your business logic, you can quickly check if an item in Table_1 has already been synced to Table_2. This gives you the freedom to copy only those items in Table_1 that haven't been updated in Table_2 yet.
like image 176
ketan vijayvargiya Avatar answered May 05 '23 11:05

ketan vijayvargiya


Instead of using Data Pipeline and writing EMR jobs, you can write a script to query all the items in Table_1 and do a transform in Java. After doing the transform in Java, do a conditional put [1] to only update the item in Table_2 if it doesn't exist. This will make sure that any changes that are made in Table_1 during this backfill will show the latest information in Table_2.

(http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html)

like image 24
nat Avatar answered May 05 '23 10:05

nat