Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

like image 770
Yuriy Avatar asked May 21 '12 09:05

Yuriy


People also ask

Which is faster S3 or DynamoDB?

For relatively small items, especially those with a size of less than 4 KB, DynamoDB runs individual operations faster than Amazon S3. DynamoDB can scale on-demand, but S3 offers better scalability. In case of huge volumes of traffic, DynamoDB can be overwhelmed for a while.

How can I speed up DynamoDB?

You can increase your DynamoDB throughput by several times, by parallelizing reads/writes over multiple partitions. Use DynamoDB as an attribute store rather than as a document store. This will not only reduce the read/write costs but also improve the performance of your operations considerably.

How long does Dynamo export to S3 take?

How long will it take to export the DynamoDB table to S3? The time required to export the whole table depends on the amount of data in your tables. However, judging from our experiments, it takes at least 30 seconds, even for the smallest table.


1 Answers

Here is the answer I finally got from AWS support recently. Hope that helps someone in a similar situation:

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.

like image 187
Yuriy Avatar answered Nov 09 '22 06:11

Yuriy