I want to use sharding in arangoDB.I have made coordinators, DBServers as mentioned in documentation 2.8.5. But still can someone still explain it in details and also how can I able to check the performance of my query after and before sharding.
Testing your application can be done with a local cluster, were all instances run on one machine - which is what you already did, if I get that correctly?
An ArangoDB cluster consists of coordinator and dbserver nodes. Coordinators don't have own user specific local collections on disk. Their role is to handle the I/O with the clients, parse, optimize and distribute the queries and the user data to the dbserver nodes. Foxx services will also be run on the coordinators. DBServers are the storage nodes in this setup, they keep the user data.
To compare the performance between clustered and non clustered mode you import a dataset on a clustered instance and a non clustered one and compare the query result times. Since the cluster setup can have more network communication (i.e. if you do a join) than the single server case, the performance can be different. On a physically distributed cluster you may achieve higher throughput, since in the end the cluster nodes are own machines and have their own IO paths that end on separate physical harddisks.
In the cluster case you create collections specifying the number of shards using the numberOfShards
parameter; the shardKeys
parameter can control the distribution of your documents across the shards. You should choose that key so documents distribute well across the shards (i.e. are not inbalanced to just one shard). The numberOfShards
can be an arbitrary value and doesn't have to corrospond to the number of dbserver nodes - it could even be bigger so you can more easily move a shard from one dbserver to a new dbserver when scaling up your cluster to more nodes in the future to adapt to higher loads.
When you're developping AQL queries with cluster use in mind, its essential to use the explain command to inspect how the query is distributed across the clusters, and where filters can be deployed:
db._create("sharded", {numberOfShards: 2})
db._explain("FOR x IN sharded RETURN x")
Query string:
FOR x IN sharded RETURN x
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 1 - FOR x IN sharded /* full collection scan */
6 RemoteNode 1 - REMOTE
7 GatherNode 1 - GATHER
3 ReturnNode 1 - RETURN x
Indexes used:
none
Optimization rules applied:
Id RuleName
1 scatter-in-cluster
2 remove-unnecessary-remote-scatter
In this simple query the RETURN
& GATHER
-nodes are on the coordinator; the nodes upwards including the REMOTE
-node are deployed to the DB-server.
In general less REMOTE
/ SCATTER
-> GATHER
pairs means less cluster communication. The closer FILTER
nodes can be deployed to *CollectionNodes
to reduce the amount of the documents to be sent via the REMOTE
-nodes the better the performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With