I've got a Cassandra cluster (3 nodes, all nodes deployed to AWS) that I am trying to migrate over to a DataStax cluster. It's simply time to stop managing these nodes myself.
I have multiple producers and consumers all reading/writing data, all day long, to my Cassandra cluster. I don't have the option of putting an app/service/proxy in front of my Cassandra cluster, and then just flipping the switch cleanly so that all reads/writes go to/from my Cassandra, over to DataStax. So there's no clean way to migrate the tables one at a time. I'm also trying to achieve zero (or near zero) downtime for all producers/consumers of the data. One hard requirement: the migration cannot be lossy. No lost data!
I'm thinking the best strategy here is a four step process:
This solution is the most minimally-invasive, closest-to-zero-downtime solution I can come up with, but assumes a few things:
I guess I'm wondering if this strategy is: (1) doable/feasible, and (2) optimal; and if there are any features/tools in the Cassandra/DataStax ecosystem that I could leverage to make this any better (faster and with zero downtime).
Scale-out NoSQL for any workload Built on Apache Cassandra™, DataStax Enterprise adds NoSQL workloads including search, graph, and analytics, with operational reliability hardened by the largest internet apps and the Fortune 100.
DataStax Enterprise (DSE) is the always-on, scalable data platform built on Apache Cassandra and designed for hybrid Cloud.
With Cassandra, an important goal of the design is to optimize how data is distributed around the cluster. Sorting is a Design Decision: In Cassandra, sorting can be done only on the clustering columns specified in the PRIMARY KEY.
If all your queries will be based on the same partition key, Cassandra is your best bet. If you get a query on an attribute that is not the partition key, Cassandra allows you to replicate the whole data with a new partition key. So now you have 2 replicas of the same data with 2 different partition keys.
The four steps you've outlined is definitely a viable option to go. There's also the route of doing a simple rolling binary install: https://docs.datastax.com/en/latest-upgrade/upgrade/datastax_enterprise/upgrdCstarToDSE.html
I'll speak in the context of the steps you provided above. If you're curious about the rolling binary install, we can definitely chat about that as well.
Note doc links are specific to Cassandra 3.0 (DataStax 5.0) - make sure the doc versions match your Cassandra version.
If the current major Cassandra version == current major Cassandra version in DataStax, you should be able to add the 'DataStax' nodes as a new DC in the same cluster your current Cassandra environment belongs to following: http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html - That will bring in the existing data from existing Cassandra DC to DataStax DC.
If you're mismatching Cassandra versions (current Cassandra is older/newer than DataStax Cassandra), then you may want to reach out to DataStax via https://academy.datastax.com/slack as the process will be more specific to your environment and can vary greatly.
As outlined in the docs, you'll want to run
ALTER KEYSPACE "your-keyspace" WITH REPLICATION =
{'class’: 'NetworkTopologyStrategy', 'OldCassandraDC':3, 'DataStaxDC':3};
(obviously changing DC name and replication factor to your specs)
This will make sure new data from your producers will replicate to the new DataStax nodes.
You can then run nodetool rebuild -- name_of_existing_data_center
from the DataStax nodes to stream data over from the existing Cassandra nodes. Depending on how much data there is, it may be somewhat time consuming but it's the easiest, most hands off way to do it.
You would then want to update the contact points in your producers/consumers one by one before decommissioning the old Cassandra DC.
A few tips from my experience:
nodetool rebuild
, do it with screen so that you can see when it completes (or errors), Otherwise, you would have to monitor progress by using nodetool netstats
and check streaming activity. Hope that helps!
I presume you mean the Datastax Managed product, where they run cassandra for you. If you just mean "run DSE on your own AWS instances", you can do a binary upgrade in-place.
The questions you asked are best asked of Datastax - if you're going to pay them, you may as well ask them questions (that's what customers do).
Your 4 step approach is mostly pretty logical, but probably overly complex. Most cassandra drivers will auto-discover new hosts, and auto-evict old/leaving hosts, so once you have all the new Datastax Managed nodes in the cluster (assuming they allow that), you can run repair to guarantee consistency, then decommission your existing nodes - your app will keep working (isn't Cassandra great?). You'll want to update your app config to have the new Datastax Managed nodes in your app config / endpoints, but that doesn't need to be done in advance.
The one caveat here is the latency involved - going from your environment to Datastax Managed may introduce latency. In that case, you have an intermediate step you can consider where you add the Datastax Managed nodes as a different "Datacenter" within cassandra, expand the replication factor, and use LOCAL_
consistency levels to control which DC gets the queries (and then you CAN move your producers/consumers over individually).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With