I'm trying to iterate through all items in my DynamoDB table. (I understand this is an inefficient process but am doing this one-time to build an index table.)
I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results.
Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. It's extremely slow.
Here is my code:
def search(table, scan_filter=None, range_key=None,
attributes_to_get=None,
limit=None):
""" Scan a database for values and return
a dict.
"""
start_key = None
num_results = 0
total_results = []
loop_iterations = 0
request_limit = limit
while num_results < limit:
results = self.conn.layer1.scan(table_name=table,
attributes_to_get=attributes_to_get,
exclusive_start_key=start_key,
limit=request_limit)
num_results = num_results + len(results['Items'])
start_key = results['LastEvaluatedKey']
total_results = total_results + results['Items']
loop_iterations = loop_iterations + 1
request_limit = request_limit - results['Count']
print "Count: " + str(results['Count'])
print "Scanned Count: " + str(results['ScannedCount'])
print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S'])
print "Capacity: " + str(results['ConsumedCapacityUnits'])
print "Loop Iterations: " + str(loop_iterations)
return total_results
Calling the function:
db = DB()
results = db.search(table='media',limit=500,attributes_to_get=['id'])
And my output:
Count: 96
Scanned Count: 96
Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o
Capacity: 517.5
Loop Iterations: 1
Count: 109
Scanned Count: 109
Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY
Capacity: 516.5
Loop Iterations: 2
Count: 104
Scanned Count: 104
Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg
Capacity: 516.0
Loop Iterations: 3
Count: 104
Scanned Count: 104
Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds
Capacity: 513.0
Loop Iterations: 4
Count: 100
Scanned Count: 100
Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY
Capacity: 516.5
Loop Iterations: 5
Is this expected behavior? Or, what am I doing wrong?
You are not doing anything wrong
This is closely related to the way Amazon computes the capacity unit. First, it is extremely important to understand that:
capacity units == reserved computational units
capacity units != reserved network transit
Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan
.
During a Scan
operation, there is a fundamental distinction between
limit
is already reachedas the capacity unit
is a compute unit, you pay for the scanned Items. Well, actually, you pay for the cumulated size of the scanned items. Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB
The scanned size does not depend on any filter, be it a field selector or a result filter.
From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm.
I have a test table which contains only very small elements. A Scan consumes only 1.0
Capacity unit to retrieve 100 Items because cumulated size < 2KB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With