Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.

My problem is that I am not able to comprehend the results generated for my specific use case.

My schema looks something like this:

CREATE TABLE Table_test(
      ID uuid,
      Time timestamp,
      Value double,
      Date timestamp,
      PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;

I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.

A few specifics of the custom yaml file are as follows:

insert:
    partitions: fixed(100)
    select: fixed(1)/2
    batchtype: UNLOGGED

columnspecs:
    -name: Time
     size: fixed(1000)
    -name: ID
     size: uniform(1..100)
    -name: Date
     size: uniform(1..10)
    -name: Value
     size: uniform(-100..100)

My observations so far are as follows:

  1. With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
  2. The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)

The output message also confirms the same:

Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)

The results that I get are the following for consecutive runs with the same configuration as above:

Run Total_ops   Op_rate Partition_rate  Row_Rate   Time 
1     56           19     1885           943246     3.0
2     46           46     4648          2325498     1.0
3     27           30     2982          1489870     0.9
4     59           19     1932           966034     3.1
5     100          17     1730           865182     5.8

Now what I need to understand are as follows:

  1. Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
  2. Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
  3. How do I determine the batch size per thread here? In my example, is the batch size 50000?

Thanks in advance.

like image 688
N2M Avatar asked Mar 16 '23 21:03

N2M


1 Answers

Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).

The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)

Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate

Total OP should include all operations, read and write. So if you have any read operations those would also be included.

I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

like image 197
RussS Avatar answered Mar 25 '23 03:03

RussS