cassandra sharding and replication

Tags:

I am new to Cassandra was going though this Article explaining sharding and replication and I am stuck at a point that is -

I have a cluster with 6 Cassandra nodes configured at my local machine. I create a new keyspace "TestKeySpace" with replication factor as 6 and a table in keyspace "employee" and primary key is auto-increment-number named RID. I am not able to understand how this data will be partitioned and replicated. What I want to know is since I am keeping my replication factor to be 6, and data will be distributed on multiple nodes, then will each node will be having exactly same data as the other nodes or not?

What If my cluster has following configuration -

    Number of nodes - 6 (n1, n2 ,n3, n4, n5 and n6).
    replication_factor - 3.

How can I determine that for any one node (let say n1), on which other two nodes the data is replicated and which other nodes are behaving as different shards.

Thanks in Advance.

Regards, Vibhav

PS - If anybody down votes this question kindly do mention in comments what went wrong.

579

asked Jan 11 '17 10:01

Vibhav Singh Rohilla

1 Answers

I will explain this with simple example. A keyspace in cassandra is equivalent to database schema name in RDBMS.

First create a keyspace -

CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = { 
 'class' : 'SimpleStrategy', 
 'replication_factor' : 3 
};

Lets create a simple table -

CREATE TABLE USER_BY_USERID(
 userid int,
 name text,
 email text,
 PRIMARY KEY(userid, name)
) WITH CLUSTERING ORDER BY(name  DESC);

In this example, userid is your partition key and name is clustering key. Partition is also called row key, this key determines on which node row will be saved.

Your first question -

I am not able to understand how this data will be partitioned?

Data will be partitioned based on your partition key. By default C* uses Murmur3partitioner. You can change the partitioner in cassandra.yaml configuration file. How partitions happens, is also depends on your configuration. You can specify range of tokens for each node, for example take a look at below cassandra.yaml configuration file. I have specified 6 node form your question.

cassandra.yaml for Node 0:

cluster_name: 'MyCluster'
initial_token: 0
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 198.211.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

cassandra.yaml for Node 1:

cluster_name: 'MyCluster'
initial_token: 3074457345618258602
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 192.241.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

cassandra.yaml for Node 2:

cluster_name: 'MyCluster'
initial_token: 6148914691236517205
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

.......Node3 ...... Node4 ....

cassandra.yaml for Node 5:

cluster_name: 'MyCluster'
initial_token: {some large number}
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

lets take this insert statement -

INSERT INTO USER_BY_USERID VALUES(
 1,
 "Darth Veder",
 "[email protected]"
);

Partitioner will calculate the hash of the PARTITION key (in above example userid - 1), and decides which node this row will be saved. Lets say calculated hash is something 12345, this row will be saved at Node 0 (look for the initial_token value for Node0 in above configuration).

Complete cassandra.yaml configuration configCassandra_yaml_r

You can go through this deployCalcTokens to know how to generate tokens.

Second question -

how data gets replicated?

Depending on your replication strategy and replication factor, the data gets replicated on each node. you have to specify Replication factor and replication strategy while creating keyspace. For example, in above example, I have used SimpleStrategy as replication strategy. This strategy is suitable for small cluster. For geologically distributed application you can use NetworkTopologyStrategy. replication_factor specifies, how many copies of a row to be created, in this example three copies of each row will be created. With simple strategy, cassandra will use clockwise direction to copy the row.

In above example, the row is saved at Node0 and the same node gets copied on Node1 and Node2. Let's take another example -

INSERT INTO USER_BY_USERID VALUES(
 448454,
 "Obi wan kenobi",
 "[email protected]"
);

For user id 448454, the calculated hash is say 3074457345618258609, so this row will be save at Node2 (look for the initial_token value for node 2 in above configuration) and also get copied in clockwise direction to Node3 and Node4 (remember we have specified replication factor of 3, so only three copies Noe2, Node3, Node4).

Hope this helps.

173

answered Sep 20 '22 10:09

Gunwant

Related questions
                            
                                setting up cassandra multi node cluster on a single ubuntu server
                            
                                How do you insert custom timeuuid's to cassandra without the now() function?
                            
                                Read/Write Strategy For Consistency Level
                            
                                Is a cassandra session thread safe? (using cpp driver)
                            
                                Is there an elegant way to perform a JSON update via CQL (Cassandra)?
                            
                                How cassandra replicates data
                            
                                What is the difference between a secondary index and an inverted index in Cassandra?
                            
                                Meteor.js possible with Cassandra instead of MongDB? [closed]
                            
                                Datastax Cassandra without root
                            
                                bash: jstat: command not found
                            
                                Cassandra node almost out of space, but nodetool cleanup is increasing disk use?
                            
                                Cassandra 3.0 updated SSTable format
                            
                                Cassandra NoHostAvailable: error in CQLSH
                            
                                Cassandra failed to connect
                            
                                Storing media files in Cassandra
                            
                                How to store array of objects in Cassandra
                            
                                Cassandra load balancing with an ordered partitioner?
                            
                                Which PHP client library to use with Cassandra?
                            
                                DataStax Devcenter fails to connect to the remote cassandra database
                            
                                batch size of prepared statement in spring data cassandra

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

cassandra sharding and replication

Tags:

cassandra

replication

sharding

database-replication

Vibhav Singh Rohilla

People also ask

1 Answers

Gunwant

Recent Activity

Donate For Us