Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice cassandra setup on ec2 with large amount of data

I am doing a large migration from physical machines to ec2 instances.

As of right now I have 3 x.large nodes each with 4 instance store drives (raid-0 1.6TB). After I set this this up I remembered that "The data on an instance store volume persists only during the life of the associated Amazon EC2 instance; if you stop or terminate an instance, any data on instance store volumes is lost."

What do people usually do in this situation? I am worried that if one of the boxes crash then all of the data will be lost on that box if it is not 100% replicated on another.

http://www.hulen.com/?p=326 I read in the above link that these guys use ephermal drives and periodically backup the content using the EBS drives and snapshots."

In this question here: How do I take a backup of aws ec2 instance/ephemeral storage? People claim that you cannot backup ephermal data onto EBS snapshots.

Is my best choice to use a few EBS drives and raid0 them together and be able to take snapshots directly from them? I know this is probably the most expensive solution, however, it seems to make the most sense.

Any info would be great.

Thank you for your time.

like image 997
John Z Avatar asked Jan 27 '14 16:01

John Z


People also ask

Which instance type is best suited for accessing huge amounts of data?

Powered by an Intel Xeon processor, H1 instances offer high disk throughput and enhanced networking of up to 25Gbps. These instances are most suitable for data-intensive applications like MapReduce, and for applications that require high throughput and sequential access to large data volumes.

How much data can you store in Cassandra?

Maximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data. For Cassandra 1.1, it is 500 to 800GB per node.

What is the best EC2 instance for heavy CPU load?

M3 instances are recommended if you are seeking general-purpose instances with demanding CPU requirements. M1 instances are the original family of general-purpose instances and provide the lowest cost options for running your applications.

Who has the biggest Cassandra instance?

4. Apple has the biggest Cassandra instance.


1 Answers

I have been running Cassandra on EC2 for over 2 years. To address your concerns, you need to form a proper availability architecture on EC2 for your Cassandra cluster. Here is a bullet list for you to consider:

  1. Consider at least 3 zones for setting up your cluster;
  2. Use NetworkTopologyStrategy with EC2Snitch/EC2MultiRegionSnitch to propagate a replica of your data to each zone; this means that the machines in each zone will have your full data set combined; for example the strategy_options would be like {us-east:3}.

The above two tips should satisfy basic availability in AWS and in case your queries are sent using LOCAL_QUORUM, your application will be fine even if one zone goes down.

If you are concerned about 2 zones going down (don't recall it happened in AWS for the past 2 years of my use), then you can also add another region to your cluster.

With the above, if any node dies for any reason, you can restore it from nodes in other zones. After all, CAssandra was designed to provide you with this kind of availability.

About EBS vs Ephemeral:

I have always been against using EBS volumes in anything production because it is one of the worst AWS service in terms of availability. They go down several times a year, and their downside usually cascades to other AWS services like ELBs and RDS. They are also like attached Network storage, so any read/write will have to go over the Network. Don't use them. Even DataStax doesn't recommend them:

http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/architecture/../../cassandra/architecture/architecturePlanningEC2_c.html

About Backups:

I use a solution called Priam (https://github.com/Netflix/Priam) which was written by Netflix. It can take a nightly snapshot of your cluster and copy everything to S3. If you enable incremental_backups, it also uploads incremental backups to S3. In case a node goes down, you can trigger a restore on the specific node using a simple API call. It restores a lot faster and does not put a lot of streaming load on your other nodes. I also added a patch to it which let's you do fancy things like bringing up multiple DCs inside one AWS region.

You can read about my setup here: http://aryanet.com/blog/shrinking-the-cassandra-cluster-to-fewer-nodes

Hope above helps.

like image 126
Arya Avatar answered Oct 18 '22 11:10

Arya