Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to read data only from a single node in a Cassandra cluster with a replication factor of 3?

I know that Cassandra have different read consistency levels but I haven't seen a consistency level which allows as read data by key only from one node. I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read. Even if we choose a consistency level of one we will ask all nodes but wait for the first response from any node. That is why we will load not only one node when we read but 3 (4 with a coordinator node). I think we can't really improve a read performance even if we set a bigger replication factor.

Is it possible to read really only from a single node?

like image 272
Oleksandr Avatar asked Apr 08 '16 17:04

Oleksandr


People also ask

How much data can a single Cassandra node effectively handle?

Maximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data. For Cassandra 1.1, it is 500 to 800GB per node. Be sure to account for replication.

Which replication strategy is used in Cassandra for single data center?

Two replication strategies are available: SimpleStrategy : Use only for a single datacenter and one rack. If you ever intend more than one datacenter, use the NetworkTopologyStrategy .

How does replication factor work in Cassandra?

A replication factor of one means that there is only one copy of each row in the Cassandra cluster. A replication factor of two means there are two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.

How does Cassandra determine which node in a ring receives which data?

Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token.


2 Answers

Are you using a Token-Aware Load Balancing Policy?

If you are, and you are querying with a consistency of LOCAL_ONE/ONE, a read query should only contact a single node.

Give the article Ideology and Testing of a Resilient Driver a read. In it, you'll notice that using the TokenAwarePolicy has this effect:

"For cases with a single datacenter, the TokenAwarePolicy chooses the primary replica to be the chosen coordinator in hopes of cutting down latency by avoiding the typical coordinator-replica hop."

So here's what happens. Let's say that I have a table for keeping track of Kerbalnauts, and I want to get all data for "Bill." I would use a query like this:

SELECT * FROM kerbalnauts WHERE name='Bill';

The driver hashes my partition key value (name) to the token of 4639906948852899531 (SELECT token(name) FROM kerbalnauts WHERE name='Bill'; returns that value). If I am working with a 6-node cluster, then my primary token ranges will look like this:

node   start range              end range
1)     9223372036854775808 to  -9223372036854775808
2)    -9223372036854775807 to  -5534023222112865485
3)    -5534023222112865484 to  -1844674407370955162
4)    -1844674407370955161 to   1844674407370955161
5)     1844674407370955162 to   5534023222112865484
6)     5534023222112865485 to   9223372036854775807

As node 5 is responsible for the token range containing the partition key "Bill," my query will be sent to node 5. As I am reading at a consistency of LOCAL_ONE, there will be no need for another node to be contacted, and the result will be returned to the client...having only hit a single node.

Note: Token ranges computed with:

python -c'print [str(((2**64 /5) * i) - 2**63) for i in range(6)]'
like image 187
Aaron Avatar answered Nov 02 '22 04:11

Aaron


I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read

Wrong, with Consistency Level ONE the coordinator picks the fastest node (the one with lowest latency) to ask for data.

How does it know which replica is the fastest ? By keeping internal latency stats for each node.

With consistency level >= QUORUM, the coordinator will ask for data from the fastest node and also asks for digest from other replicas

From the client side, if you choose the appropriate load balancing strategy (e.g. TokenAwareStrategy) the client will always contact the primary replica when using consistency level ONE

like image 31
doanduyhai Avatar answered Nov 02 '22 03:11

doanduyhai