Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

boto dynamodb: is there a way to optimize batch writing?

I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):

def do_batch_write(items,conn,table):
    batch_list = conn.new_batch_write_list()
    batch_list.add_batch(table, puts=items)
    while True:
            response = conn.batch_write_item(batch_list)
            unprocessed = response.get('UnprocessedItems', None)
            if not unprocessed:
                    break
            # identify unprocessed items and retry batch writing

I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.

Thx

like image 535
I Z Avatar asked Dec 21 '22 08:12

I Z


2 Answers

From the documentation:

"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."

The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.

I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.

I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.

like image 87
garnaat Avatar answered Feb 01 '23 05:02

garnaat


From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.

So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.

Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.

like image 43
oDDsKooL Avatar answered Feb 01 '23 06:02

oDDsKooL