I am using DataStax Cassandra 1.2.3 on a 6 node cluster each having quad-core 3GHz processor and 8GB RAM. Recently, I started to use the VNodes feature by setting the num_tokens to 256 first and then to 128. I observe a decline in performance [No.of write requests/sec] for the schema that I am using. I mostly have a normalized schema with a mix of wide tables & counter column families.
Has anyone observed a decline in performance using the VNodes? Are there any known optimization techniques to better utilize VNodes?
Is there an optimum value for num_tokens that can be derived for a given hardware configuration/node?
Also, I see that the cluster is nearly balanced with one node taking a higher share of the load automatically although I have a homogeneous cluster. Prior to using VNodes I would manually balance the cluster for Murmer3Partitioner and the performance was good.
Thanks, VS
Virtual nodes, known as Vnodes, distribute data across nodes at a finer granularity than can be easily achieved if calculated tokens are used. Vnodes simplify many tasks in Cassandra: Tokens are automatically calculated and assigned to each node.
In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.
The num_tokens setting influences the way Cassandra allocates data amongst the nodes, how that data is retrieved, and how that data is moved between nodes. Under the hood Cassandra uses a partitioner to decide where data is stored in the cluster.
The partitioner determines how data is distributed across the nodes in a Cassandra cluster. Basically, a partitioner is a hash function to determine the token value by hashing the partition key of a row's data. Then, this partition key token is used to determine and distribute the row data within the ring.
(This is a modified version of my post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Why-so-many-vnodes-td7588267.html)
The number of tokens per node (let's call it T and the number of nodes N), 256, was chosen to give good load balancing for random token assignments for most cluster sizes. For small T, a random choice of initial tokens will in most cases give a poor distribution of data. The larger T is, the closer to uniform the distribution will be, with increasing probability.
Also, for small T, when a new node is added, it won't have many ranges to split so won't be able to take an even slice of the data.
For this reason T should be large. But if it is too large, there are too many slices to keep track of so performance will be hit. The function to find which keys live where becomes more expensive and operations that deal with individual vnodes e.g. repair become slow. (An extreme example is SELECT * LIMIT 1, which when there is no data has to scan each vnode in turn in search of a single row. This is O(NT) and for even quite small T takes seconds to complete.)
So 256 was chosen to be a reasonable balance. I don't think most users will find it too slow; users with extremely large clusters may need to increase it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With