In the introduction course of Cassandra DataStax they say that all of the clocks of a Cassandra cluster nodes, have to be synchronized, in order to prevent READ queries to 'old' data. If one or more nodes are down they can not get updates, but as soon as they back up again - they would update and there is no problem... So, why Cassandra cluster need synchronized clocks between nodes?

In general it is always a good idea to keep your server clocks in sync, but a primary reason why clock sync is needed between nodes is because Cassandra uses a concept called 'Last Write Wins' to resolve conflicts and determine which mutation represents the most correct up-to date state of data. This is explained in Why cassandra doesn't need vector clocks. Whenever you 'mutate' (write or delete) column(s) in cassandra a timestamp is assigned by the coordinator handling your request. That timestamp is written with the column value in a cell. When a read request occurs, cassandra builds your results finding the mutations for your query criteria and when it sees multiple cells representing the same column it will pick the one with the most recent timestamp (The read path is more involved than this but that is all you need to know in this context). Things start to become problematic when your nodes' clocks become out of sync. As I mentioned, the coordinator node handling your request assigns the timestamp. If you do multiple mutations to the same column and different coordinators are assigned, you can create some situations where writes that happened in the past are returned instead of the most recent one. Here is a basic scenario that describes that: Assume we have a 2 node cluster with nodes A and B. Lets assume an initial state where A is at time <code>t10</code> and B is at time <code>t5</code>. <ol> <li>User executes <code>DELETE C FROM tbl WHERE key=5</code>. Node A coordinates the request and it is assigned timestamp <code>t10</code>.</li> <li>A second passes and a User executes <code>UPDATE tbl SET C='data' where key=5</code>. Node B coordinates the request and it is assigned timestamp <code>t6</code>.</li> <li>User executes the query <code>SELECT C from tbl where key=5</code>. Because the <code>DELETE</code> from Step 1 has a more recent timestamp (<code>t10 > t6</code>), no results are returned.</li> </ol> Note that newer versions of the datastax drivers will start defaulting to use Client Timestamps to have your client application generate and assign timestamps to requests instead of relying on the C* nodes to assign them. datastax java-driver as of 3.0 now defaults to client timestamps (read more about there in 'Client-side generation'). This is very nice if all requests come from the same client, however if you have multiple applications writing to cassandra you now have to worry about keeping your client clocks in sync.

Why Cassandra cluster need synchronized clocks between nodes?

Video Answer

1 Answers

In general it is always a good idea to keep your server clocks in sync, but a primary reason why clock sync is needed between nodes is because Cassandra uses a concept called 'Last Write Wins' to resolve conflicts and determine which mutation represents the most correct up-to date state of data. This is explained in Why cassandra doesn't need vector clocks.

Whenever you 'mutate' (write or delete) column(s) in cassandra a timestamp is assigned by the coordinator handling your request. That timestamp is written with the column value in a cell.

When a read request occurs, cassandra builds your results finding the mutations for your query criteria and when it sees multiple cells representing the same column it will pick the one with the most recent timestamp (The read path is more involved than this but that is all you need to know in this context).

Things start to become problematic when your nodes' clocks become out of sync. As I mentioned, the coordinator node handling your request assigns the timestamp. If you do multiple mutations to the same column and different coordinators are assigned, you can create some situations where writes that happened in the past are returned instead of the most recent one.

Here is a basic scenario that describes that:

Assume we have a 2 node cluster with nodes A and B. Lets assume an initial state where A is at time t10 and B is at time t5.

User executes DELETE C FROM tbl WHERE key=5. Node A coordinates the request and it is assigned timestamp t10.
A second passes and a User executes UPDATE tbl SET C='data' where key=5. Node B coordinates the request and it is assigned timestamp t6.
User executes the query SELECT C from tbl where key=5. Because the DELETE from Step 1 has a more recent timestamp (t10 > t6), no results are returned.

Note that newer versions of the datastax drivers will start defaulting to use Client Timestamps to have your client application generate and assign timestamps to requests instead of relying on the C* nodes to assign them. datastax java-driver as of 3.0 now defaults to client timestamps (read more about there in 'Client-side generation'). This is very nice if all requests come from the same client, however if you have multiple applications writing to cassandra you now have to worry about keeping your client clocks in sync.

196

answered Sep 21 '22 08:09

Andy Tolbert

Related questions
                            
                                CQLSH: Converting unix timestamp to datetime
                            
                                Does CQL3 require a schema for Cassandra now?
                            
                                connecting to cassandra from PHP [closed]
                            
                                How to use Kafka Connect for Cassandra without Confluent
                            
                                Is there a schema versioning tool for cassandra [closed]
                            
                                Cassandra type error
                            
                                Cassandra-cli cant connect to remote cassandra server
                            
                                Counter Vs Int column in Cassandra?
                            
                                Disable colors in cqlsh
                            
                                Is it possible to use cql to query collections in a row?
                            
                                Inserting Analytic data from Spark to Postgres
                            
                                When are rows overwritten in cassandra
                            
                                Problems connecting to Cassandra pool from Spring application
                            
                                Cassandra Wide Row/Dynamic Columns
                            
                                Why is it so bad to have large partitions in Cassandra?
                            
                                How to delete graph in Titan with Cassandra storage backend?
                            
                                CQL3 Each row to have its own schema
                            
                                Cassandra Static Column design [closed]
                            
                                Cassandra cli: Convert hex values into a human-readable format
                            
                                Persistent GC issues with Cassandra - long app pauses

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Cassandra cluster need synchronized clocks between nodes?

Tags:

cassandra

Reshef

People also ask

Video Answer

1 Answers

Andy Tolbert

Recent Activity

Donate For Us