Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Data Pipeline to export a DynamoDB table that has on-demand provision

I used to use the Data Pipeline template called Export DynamoDB table to S3 to export a DynamoDB table to file. I recently updated all of my DynamoDB tables to have on-demand provision and the template no longer works. I'm pretty certain this is because the old template specifies a percentage of DynamoDB throughput to consume, which is not relevant to on-demand tables.

I tried exporting the old template to JSON, removing the reference to throughput percentage consumption, and creating a new pipeline. However, this was unsuccessful.

Can anyone suggest how to convert an old style pipeline script with throughput provision to a new on-demand table script?

Here is my original functioning script:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myDDBReadThroughputRatio": "0.25",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

Here is my attempted update that failed:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

And here is the error from Data Pipeline execution:

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
like image 293
F_SO_K Avatar asked Feb 13 '19 09:02

F_SO_K


People also ask

How do I export data from DynamoDB table?

To export a DynamoDB table, you use the AWS Data Pipeline console to create a new pipeline. The pipeline launches an Amazon EMR cluster to perform the actual export. Amazon EMR reads the data from DynamoDB, and writes the data to an export file in an Amazon S3 bucket.

How can I use data pipeline to back up a DynamoDB table to an S3 bucket that is in a different account?

Activate the pipeline to back up the DynamoDB table to the S3 bucket in the destination account. Create a DynamoDB table in the destination account. To restore the source table to the destination table, create a pipeline using the Import DynamoDB backup data from S3 Data Pipeline template.


1 Answers

I opened a support ticket with AWS on this. Their response was pretty comprehensive. I will paste it below


Thanks for reaching out regarding this issue.

Unfortunately, Data Pipeline export/ import jobs for DynamoDB do not support DynamoDB's new On-Demand mode [1].

Tables using On-Demand capacity do not have defined capacities for Read and Write units. Data Pipeline relies on this defined capacity when calculating the throughput of the pipeline.

For example, if you have 100 RCU (Read Capacity Units) and a pipeline throughput of 0.25 (25%), the effective pipeline throughput would be 25 read units per second (100 * 0.25). However, in the case of On-Demand capacity, the RCU and WCU (Write Capacity Units) are reflected as 0. Regardless of the pipeline throughput value, the calculated effective throughput is 0.

The pipeline will not execute when the effective throughput is less than 1.

Are you required to export DynamoDB tables to S3?

If you are using these table exports for backup purposes only, I recommend using DynamoDB's On-Demand Backup and Restore feature (a confusingly similar name to On-Demand capacity) [2].

Note that On-Demand Backups do not impact the throughput of your table, and are completed in seconds. You only pay for the S3 storage costs associated with the backups. However, these table backups are not directly accessible to customers, and can only be restored to the source table. This method of backups is not suitable if you wish to perform analytics on the backup data, or import the data into other systems, accounts or tables.

If you need to use Data Pipeline to export DynamoDB data, then the only way forward is to set the table(s) to Provisioned capacity mode.

You could do this manually, or include it as an activity in the pipeline itself, using an AWS CLI command [3].

For example (On-Demand is also referred to as Pay Per Request mode):

$ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

-

$ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST

Note that after disabling On-Demand capacity mode, you need to wait for 24 hours before you can enable it again.

=== Reference Links ===

[1] DynamoDB On-Demand capacity (also refer to the note on unsupported services/ tools): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand

[2] DynamoDB On-Demand Backup and Restore: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html

[3] AWS CLI reference for DynamoDB "update-table": https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

like image 59
F_SO_K Avatar answered Sep 18 '22 12:09

F_SO_K