Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Cassandra VNodes trade performance?

I am using DataStax Cassandra 1.2.3 on a 6 node cluster each having quad-core 3GHz processor and 8GB RAM. Recently, I started to use the VNodes feature by setting the num_tokens to 256 first and then to 128. I observe a decline in performance [No.of write requests/sec] for the schema that I am using. I mostly have a normalized schema with a mix of wide tables & counter column families.

  1. Has anyone observed a decline in performance using the VNodes? Are there any known optimization techniques to better utilize VNodes?

  2. Is there an optimum value for num_tokens that can be derived for a given hardware configuration/node?

  3. Also, I see that the cluster is nearly balanced with one node taking a higher share of the load automatically although I have a homogeneous cluster. Prior to using VNodes I would manually balance the cluster for Murmer3Partitioner and the performance was good.

Thanks, VS

like image 761
vinay sudhakar Avatar asked Jun 13 '13 10:06

vinay sudhakar


People also ask

What are Vnodes in Cassandra?

Virtual nodes, known as Vnodes, distribute data across nodes at a finer granularity than can be easily achieved if calculated tokens are used. Vnodes simplify many tasks in Cassandra: Tokens are automatically calculated and assigned to each node.

How does Cassandra distribute data?

In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.

What is Num_tokens in Cassandra?

The num_tokens setting influences the way Cassandra allocates data amongst the nodes, how that data is retrieved, and how that data is moved between nodes. Under the hood Cassandra uses a partitioner to decide where data is stored in the cluster.

How does Cassandra determine which node in a ring receives which data?

The partitioner determines how data is distributed across the nodes in a Cassandra cluster. Basically, a partitioner is a hash function to determine the token value by hashing the partition key of a row's data. Then, this partition key token is used to determine and distribute the row data within the ring.


1 Answers

(This is a modified version of my post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Why-so-many-vnodes-td7588267.html)

The number of tokens per node (let's call it T and the number of nodes N), 256, was chosen to give good load balancing for random token assignments for most cluster sizes. For small T, a random choice of initial tokens will in most cases give a poor distribution of data. The larger T is, the closer to uniform the distribution will be, with increasing probability.

Also, for small T, when a new node is added, it won't have many ranges to split so won't be able to take an even slice of the data.

For this reason T should be large. But if it is too large, there are too many slices to keep track of so performance will be hit. The function to find which keys live where becomes more expensive and operations that deal with individual vnodes e.g. repair become slow. (An extreme example is SELECT * LIMIT 1, which when there is no data has to scan each vnode in turn in search of a single row. This is O(NT) and for even quite small T takes seconds to complete.)

So 256 was chosen to be a reasonable balance. I don't think most users will find it too slow; users with extremely large clusters may need to increase it.

like image 175
Richard Avatar answered Oct 12 '22 06:10

Richard