Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARRAY_CONTAINS vs JOIN in azure-cosmosDB

The JSON documents that we plan to ingest into DocumentDb look as follows…

[
{"id":"id1","LastName": “user1”, "GroupMembership":["g1","g2"]},
{"id":"id2","LastName": “user2”, "GroupMembership":["g1","g4","g5"]},
{"id":"id3","LastName": “user3”, "GroupMembership":["g3","g4","g2"]},
…
]

We want to answer queries such as, get me count of all users who are members of group “g1” or “g2” etc…. The number of users is very large (few millions)… What is the best way to implement this query and use the index and avoid any scans… Should I be using ARRAY_CONTAINS or JOIN (does ARRAY_CONTAINS internally use the index or is it doing a scan)…

Option1)

SELECT VALUE COUNT(1) FROM Users WHERE ARRAY_CONTAINS(Users.GroupMembership, "g1") or ARRAY_CONTAINS(Users.GroupMembership, "g2")

Option2)

SELECT VALUE COUNT(1) FROM Users JOIN Membership in Users.GroupMembership WHERE Membership = "g1" or Membership = "g2"
like image 999
durga prasad Avatar asked Sep 01 '25 02:09

durga prasad


2 Answers

Both queries should utilize the index the same way, but ARRAY_CONTAINS is likely to provide a better execution time compared to JOIN. You could profile both queries using the Query Metrics as per this article: https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-sql-query-metrics#query-execution-metrics

like image 77
Samer Boshra Avatar answered Sep 02 '25 16:09

Samer Boshra


Both shall provide same index utilization, however with the JOIN usage you can get duplicating results per entry and with the ARRAY_CONTAINS you won't. I guess that difference is very significant. See more about duplicating issue in the replies to Getting duplicate records in select query for the Azure DocumentDB and Cosmos db joins give duplicate results SO question.

like image 44
Andriy Ivaneyko Avatar answered Sep 02 '25 17:09

Andriy Ivaneyko