Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

searching on array items on a DynamoDB table

I need to understand how one can search attributes of a DynamoDB that is part of an array.

So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.

Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.

  1. Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
  2. Would this secondary index be global or local? Assuming I have billions of person records?
    1. If I could create it as either LSI or GSI, please explain the pros/cons of each.

thank you very much!

like image 472
Bluetoba Avatar asked Apr 28 '18 23:04

Bluetoba


2 Answers

Its worth getting the terminology right to start with. DynamoDB supported data types are

Scalar - String, number, binary, boolean

Document - List, Map

Sets - String Set, Number Set, Binary Set

I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this

Emails: ["[email protected]", "[email protected]", "[email protected]"]

There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).

As your list of emails is not a scalar type, you cannot use it in a key or index.

Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.

like image 87
F_SO_K Avatar answered Sep 28 '22 10:09

F_SO_K


Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.

What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.

I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.

By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.

In other words a data object could use a Sort Key like this: [email protected][email protected][email protected]

Then you could query that (assuming you knew what the partition key is) with something like this:

SELECT * FROM Users WHERE User='Bob' AND Emails LIKE '%[email protected]%'

YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.

I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.

I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.

A word on Secondary Indexes (GSI / LSI)

The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.

From Amazon: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html

To me it sounds more like the issue is selecting the Keys.

LSI (Local Secondary Index) If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.

GSI (Global Secondary Index) A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).

So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).

GSI / LSI still need to generate unique values using their keys so keep that in mind!

like image 40
Necevil Avatar answered Sep 28 '22 11:09

Necevil