How to use Data Pipeline to export a DynamoDB table that has on-demand provision

Tags:

I used to use the Data Pipeline template called Export DynamoDB table to S3 to export a DynamoDB table to file. I recently updated all of my DynamoDB tables to have on-demand provision and the template no longer works. I'm pretty certain this is because the old template specifies a percentage of DynamoDB throughput to consume, which is not relevant to on-demand tables.

I tried exporting the old template to JSON, removing the reference to throughput percentage consumption, and creating a new pipeline. However, this was unsuccessful.

Can anyone suggest how to convert an old style pipeline script with throughput provision to a new on-demand table script?

Here is my original functioning script:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myDDBReadThroughputRatio": "0.25",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

Here is my attempted update that failed:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

And here is the error from Data Pipeline execution:

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java

293

asked Feb 13 '19 09:02

F_SO_K

1 Answers

I opened a support ticket with AWS on this. Their response was pretty comprehensive. I will paste it below

Thanks for reaching out regarding this issue.

Unfortunately, Data Pipeline export/ import jobs for DynamoDB do not support DynamoDB's new On-Demand mode [1].

Tables using On-Demand capacity do not have defined capacities for Read and Write units. Data Pipeline relies on this defined capacity when calculating the throughput of the pipeline.

For example, if you have 100 RCU (Read Capacity Units) and a pipeline throughput of 0.25 (25%), the effective pipeline throughput would be 25 read units per second (100 * 0.25). However, in the case of On-Demand capacity, the RCU and WCU (Write Capacity Units) are reflected as 0. Regardless of the pipeline throughput value, the calculated effective throughput is 0.

The pipeline will not execute when the effective throughput is less than 1.

Are you required to export DynamoDB tables to S3?

If you are using these table exports for backup purposes only, I recommend using DynamoDB's On-Demand Backup and Restore feature (a confusingly similar name to On-Demand capacity) [2].

Note that On-Demand Backups do not impact the throughput of your table, and are completed in seconds. You only pay for the S3 storage costs associated with the backups. However, these table backups are not directly accessible to customers, and can only be restored to the source table. This method of backups is not suitable if you wish to perform analytics on the backup data, or import the data into other systems, accounts or tables.

If you need to use Data Pipeline to export DynamoDB data, then the only way forward is to set the table(s) to Provisioned capacity mode.

You could do this manually, or include it as an activity in the pipeline itself, using an AWS CLI command [3].

For example (On-Demand is also referred to as Pay Per Request mode):

$ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

$ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST

Note that after disabling On-Demand capacity mode, you need to wait for 24 hours before you can enable it again.

=== Reference Links ===

[1] DynamoDB On-Demand capacity (also refer to the note on unsupported services/ tools): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand

[2] DynamoDB On-Demand Backup and Restore: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html

[3] AWS CLI reference for DynamoDB "update-table": https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

answered Sep 18 '22 12:09

F_SO_K

Related questions
                            
                                Amazon DynamoDB Attribute Type with CloudFormation
                            
                                DynamoDB 1 big table or multiple small tables?
                            
                                Auto-increment counter in Dynamo DB
                            
                                AWS - One of the required keys was not given a value
                            
                                Dynamodb updateitem only with global secondary index
                            
                                Can not find table using NoSQL Workbench for DynamoDB when connecting to DynamoDB Docker
                            
                                Why using a common hash key with AWS DynamoDB is a bad thing?
                            
                                Avoid throttle dynamoDB
                            
                                Why is my Elastic Beanstalk app denied PutItem access to my DynamoDB, despite its role?
                            
                                Empty String Validation Exception - DynamoDB
                            
                                Update DynamoDB Atomic Counter with Python / Boto
                            
                                AWS Ultra Low Latency Read/Write Data Store: EFS vs Dynamodb DAX vs ElastiCache
                            
                                Serverless Framework Dynamo DB Table Resource Definition with Sort Key
                            
                                AWS Error Code: ValidationException, AWS Error Message: Consistent reads are not supported on global secondary indexes
                            
                                How to write more than 25 items/rows into Table for DynamoDB?
                            
                                DynamoDB - Why can't I use an "_" as a prefix in my key condition expression?
                            
                                How to determine if a DynamoDB item was indeed deleted?
                            
                                Deleting Attribute in DynamoDB
                            
                                How to batchGet index table in DynamoDB?
                            
                                "Missing required key 'Key' in params" in Get operation of Dynamo dB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Data Pipeline to export a DynamoDB table that has on-demand provision

Tags:

amazon-dynamodb

amazon-data-pipeline

F_SO_K

People also ask

1 Answers

F_SO_K

Recent Activity

Donate For Us