It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce,
I have a general understanding of how this could work but I couldn't find any guides or tutorials on this,
So my question is how can I automate dynamodb backups (using EMR)?
So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages).
Any comments, clarifications, code samples, corrections are appreciated.
The DynamoDB Export to S3 feature is the easiest way to create backups that you can download locally or use with another AWS service. To customize the process of creating backups, you can use use Amazon EMR, AWS Glue, or AWS Data Pipeline.
When enabled, DynamoDB maintains incremental backups of your table for the last 35 days until you explicitly turn it off. You can enable PITR or initiate backup and restore operations with a single click in the AWS Management Console or a single API call.
S3 is typically used for storing files like images,logs etc. DynamoDB is a NoSQL database that can be used as a key value (schema less record) store. For simple data storage, S3 is the cheapest service. DynamoDB has the better performance, low cost and higher scalability and availability.
How long will it take to export the DynamoDB table to S3? The time required to export the whole table depends on the amount of data in your tables. However, judging from our experiments, it takes at least 30 seconds, even for the smallest table.
With introduction of AWS Data Pipeline, with a ready made template for dynamodb to S3 backup, the easiest way is to schedule a back up in the Data Pipeline [link],
In case you have special needs (data transformation, very fine grain control ...) consider the answer by @greg
There are some good guides for working with MapReduce and DynamoDB. I followed this one the other day and got data exporting to S3 going reasonably painlessly. I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup. You could set this as a cron job.
Example of a hive script exporting data from Dynamo to S3:
CREATE EXTERNAL TABLE my_table_dynamodb ( company_id string ,id string ,name string ,city string ,state string ,postal_code string) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code"); CREATE EXTERNAL TABLE my_table_s3 ( ,id string ,name string ,city string ,state string ,postal_code string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://yourBucket/backup_path/dynamo/my_table'; INSERT OVERWRITE TABLE my_table_s3 SELECT * from my_table_dynamodb;
Here is an example of a PHP script that will spin up a new EMR job flow:
$emr = new AmazonEMR(); $response = $emr->run_job_flow( 'My Test Job', array( "TerminationProtected" => "false", "HadoopVersion" => "0.20.205", "Ec2KeyName" => "my-key", "KeepJobFlowAliveWhenNoSteps" => "false", "InstanceGroups" => array( array( "Name" => "Master Instance Group", "Market" => "ON_DEMAND", "InstanceType" => "m1.small", "InstanceCount" => 1, "InstanceRole" => "MASTER", ), array( "Name" => "Core Instance Group", "Market" => "ON_DEMAND", "InstanceType" => "m1.small", "InstanceCount" => 1, "InstanceRole" => "CORE", ), ), ), array( "Name" => "My Test Job", "AmiVersion" => "latest", "Steps" => array( array( "HadoopJarStep" => array( "Args" => array( "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive", "--hive-versions", "0.7.1.3", ), "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar", ), "Name" => "Setup Hive", "ActionOnFailure" => "TERMINATE_JOB_FLOW", ), array( "HadoopJarStep" => array( "Args" => array( "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--hive-versions", "0.7.1.3", "--run-hive-script", "--args", "-f", "s3n://myBucket/hive_scripts/hive_script.hql", "-d", "INPUT=Var_Value1", "-d", "LIB=Var_Value2", "-d", "OUTPUT=Var_Value3", ), "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar", ), "Name" => "Run Hive Script", "ActionOnFailure" => "CANCEL_AND_WAIT", ), ), "LogUri" => "s3n://myBucket/logs", ) ); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With