Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set clusters and sharding in ArangoDB?

I want to use sharding in arangoDB.I have made coordinators, DBServers as mentioned in documentation 2.8.5. But still can someone still explain it in details and also how can I able to check the performance of my query after and before sharding.

like image 427
Haseb Ansari Avatar asked Oct 31 '22 07:10

Haseb Ansari


1 Answers

Testing your application can be done with a local cluster, were all instances run on one machine - which is what you already did, if I get that correctly?

An ArangoDB cluster consists of coordinator and dbserver nodes. Coordinators don't have own user specific local collections on disk. Their role is to handle the I/O with the clients, parse, optimize and distribute the queries and the user data to the dbserver nodes. Foxx services will also be run on the coordinators. DBServers are the storage nodes in this setup, they keep the user data.

To compare the performance between clustered and non clustered mode you import a dataset on a clustered instance and a non clustered one and compare the query result times. Since the cluster setup can have more network communication (i.e. if you do a join) than the single server case, the performance can be different. On a physically distributed cluster you may achieve higher throughput, since in the end the cluster nodes are own machines and have their own IO paths that end on separate physical harddisks.

In the cluster case you create collections specifying the number of shards using the numberOfShards parameter; the shardKeys parameter can control the distribution of your documents across the shards. You should choose that key so documents distribute well across the shards (i.e. are not inbalanced to just one shard). The numberOfShards can be an arbitrary value and doesn't have to corrospond to the number of dbserver nodes - it could even be bigger so you can more easily move a shard from one dbserver to a new dbserver when scaling up your cluster to more nodes in the future to adapt to higher loads.

When you're developping AQL queries with cluster use in mind, its essential to use the explain command to inspect how the query is distributed across the clusters, and where filters can be deployed:

db._create("sharded", {numberOfShards: 2})
db._explain("FOR x IN sharded RETURN x")
Query string:
 FOR x IN sharded RETURN x

Execution plan:
 Id   NodeType                  Est.   Comment
  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      1     - FOR x IN sharded /* full collection scan */
  6   RemoteNode                   1       - REMOTE
  7   GatherNode                   1       - GATHER
  3   ReturnNode                   1       - RETURN x

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   scatter-in-cluster
  2   remove-unnecessary-remote-scatter

In this simple query the RETURN & GATHER -nodes are on the coordinator; the nodes upwards including the REMOTE-node are deployed to the DB-server.

In general less REMOTE / SCATTER -> GATHER pairs means less cluster communication. The closer FILTER nodes can be deployed to *CollectionNodes to reduce the amount of the documents to be sent via the REMOTE-nodes the better the performance.

like image 150
dothebart Avatar answered Dec 05 '22 10:12

dothebart