Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve distinct values from the hash key - DynamoDB

I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.

My current code to pull 10,000 email records is

conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email']) 

But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.

Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
like image 639
macha Avatar asked May 01 '15 17:05

macha


1 Answers

Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.

Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.

like image 108
Alexander Patrikalakis Avatar answered Nov 14 '22 08:11

Alexander Patrikalakis