It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce, I have a general understanding of how this could work but I couldn't find any guides or tutorials on this, So my question is how can I automate dynamodb backups (using EMR)? So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages). Any comments, clarifications, code samples, corrections are appreciated.

There are some good guides for working with MapReduce and DynamoDB. I followed this one the other day and got data exporting to S3 going reasonably painlessly. I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup. You could set this as a cron job. Example of a hive script exporting data from Dynamo to S3: <pre class="prettyprint"><code>CREATE EXTERNAL TABLE my_table_dynamodb ( company_id string ,id string ,name string ,city string ,state string ,postal_code string) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code"); CREATE EXTERNAL TABLE my_table_s3 ( ,id string ,name string ,city string ,state string ,postal_code string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://yourBucket/backup_path/dynamo/my_table'; INSERT OVERWRITE TABLE my_table_s3 SELECT * from my_table_dynamodb; </code></pre> Here is an example of a PHP script that will spin up a new EMR job flow: <pre class="prettyprint"><code>$emr = new AmazonEMR(); $response = $emr->run_job_flow( 'My Test Job', array( "TerminationProtected" => "false", "HadoopVersion" => "0.20.205", "Ec2KeyName" => "my-key", "KeepJobFlowAliveWhenNoSteps" => "false", "InstanceGroups" => array( array( "Name" => "Master Instance Group", "Market" => "ON_DEMAND", "InstanceType" => "m1.small", "InstanceCount" => 1, "InstanceRole" => "MASTER", ), array( "Name" => "Core Instance Group", "Market" => "ON_DEMAND", "InstanceType" => "m1.small", "InstanceCount" => 1, "InstanceRole" => "CORE", ), ), ), array( "Name" => "My Test Job", "AmiVersion" => "latest", "Steps" => array( array( "HadoopJarStep" => array( "Args" => array( "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive", "--hive-versions", "0.7.1.3", ), "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar", ), "Name" => "Setup Hive", "ActionOnFailure" => "TERMINATE_JOB_FLOW", ), array( "HadoopJarStep" => array( "Args" => array( "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--hive-versions", "0.7.1.3", "--run-hive-script", "--args", "-f", "s3n://myBucket/hive_scripts/hive_script.hql", "-d", "INPUT=Var_Value1", "-d", "LIB=Var_Value2", "-d", "OUTPUT=Var_Value3", ), "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar", ), "Name" => "Run Hive Script", "ActionOnFailure" => "CANCEL_AND_WAIT", ), ), "LogUri" => "s3n://myBucket/logs", ) ); } </code></pre>

Backup AWS Dynamodb to S3

Tags:

amazon-s3

backup

amazon-dynamodb

elastic-map-reduce

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce,
I have a general understanding of how this could work but I couldn't find any guides or tutorials on this,

So my question is how can I automate dynamodb backups (using EMR)?

So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages).

Any comments, clarifications, code samples, corrections are appreciated.

839

asked Nov 29 '12 16:11

Ali

2 Answers

With introduction of AWS Data Pipeline, with a ready made template for dynamodb to S3 backup, the easiest way is to schedule a back up in the Data Pipeline [link],

In case you have special needs (data transformation, very fine grain control ...) consider the answer by @greg

167

answered Oct 07 '22 17:10

Ali

There are some good guides for working with MapReduce and DynamoDB. I followed this one the other day and got data exporting to S3 going reasonably painlessly. I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup. You could set this as a cron job.

Example of a hive script exporting data from Dynamo to S3:

CREATE EXTERNAL TABLE my_table_dynamodb (     company_id string     ,id string     ,name string     ,city string     ,state string     ,postal_code string)  STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'  TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code");  CREATE EXTERNAL TABLE my_table_s3 (     ,id string     ,name string     ,city string     ,state string     ,postal_code string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  LOCATION 's3://yourBucket/backup_path/dynamo/my_table';   INSERT OVERWRITE TABLE my_table_s3  SELECT * from my_table_dynamodb;

Here is an example of a PHP script that will spin up a new EMR job flow:

$emr = new AmazonEMR();  $response = $emr->run_job_flow(             'My Test Job',             array(                 "TerminationProtected" => "false",                 "HadoopVersion" => "0.20.205",                 "Ec2KeyName" => "my-key",                 "KeepJobFlowAliveWhenNoSteps" => "false",                 "InstanceGroups" => array(                     array(                         "Name" => "Master Instance Group",                         "Market" => "ON_DEMAND",                         "InstanceType" => "m1.small",                         "InstanceCount" => 1,                         "InstanceRole" => "MASTER",                     ),                     array(                         "Name" => "Core Instance Group",                         "Market" => "ON_DEMAND",                         "InstanceType" => "m1.small",                         "InstanceCount" => 1,                         "InstanceRole" => "CORE",                     ),                 ),             ),             array(                 "Name" => "My Test Job",                 "AmiVersion" => "latest",                 "Steps" => array(                     array(                         "HadoopJarStep" => array(                             "Args" => array(                                 "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",                                 "--base-path",                                 "s3://us-east-1.elasticmapreduce/libs/hive/",                                 "--install-hive",                                 "--hive-versions",                                 "0.7.1.3",                             ),                             "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",                         ),                         "Name" => "Setup Hive",                         "ActionOnFailure" => "TERMINATE_JOB_FLOW",                     ),                     array(                         "HadoopJarStep" => array(                             "Args" => array(                                 "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",                                 "--base-path",                                 "s3://us-east-1.elasticmapreduce/libs/hive/",                                 "--hive-versions",                                 "0.7.1.3",                                 "--run-hive-script",                                 "--args",                                 "-f",                                 "s3n://myBucket/hive_scripts/hive_script.hql",                                 "-d",                                 "INPUT=Var_Value1",                                 "-d",                                 "LIB=Var_Value2",                                 "-d",                                 "OUTPUT=Var_Value3",                             ),                             "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",                         ),                         "Name" => "Run Hive Script",                         "ActionOnFailure" => "CANCEL_AND_WAIT",                     ),                 ),                 "LogUri" => "s3n://myBucket/logs",             )         );  }

answered Oct 07 '22 15:10

greg

Related questions
                            
                                amazon s3 vs google cloud storage [closed]
                            
                                Aws S3 Filter by Tags. Search by tags
                            
                                Web Hosting on Amazon AWS (PHP + MySQL)
                            
                                OutOfMemoryError when creating AmazonS3Client in Lambda
                            
                                Text files uploaded to S3 are encoded strangely?
                            
                                Node JS AWS S3 file upload. How to get public URL response
                            
                                Amazon S3 response in JSON?
                            
                                What's a good way to collect logs from Amazon EC2 instances?
                            
                                Reading contents of a gzip file from a AWS S3 in Python
                            
                                Boto - Uploading file to a specific location on Amazon S3
                            
                                Notification of new S3 objects
                            
                                aws s3 replace file atomically
                            
                                How to upload a file to S3 without creating a temporary local file
                            
                                Amazon Cloudfront Cache-Control: no-cache header has no effect after 24 hours
                            
                                Download file from url and upload it to AWS S3 without saving - node.js
                            
                                AWS CLI get download S3 URL for private bucket from AWS CLI
                            
                                Cloudfront serving over own SSL certificate
                            
                                AWS CloudFront returns http 307 when origin is S3 bucket
                            
                                Asynchronous File Upload to Amazon S3 with Django
                            
                                Possible to stream videos using Amazon S3/CloudFront with HTML5 player?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With