Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cassandra Read performance with Collection

I have the following columnfamily defined in cassandra

CREATE TABLE metric (
period int,
rollup int,
tenant text,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

Does the size of data list affect the read performance in cassandra ? If yes how can we measure it..?

The issue is that the time taken to query Data-Set1 from cassandra to get 8640 rows (where # of elements in the data list for each row is 90) for a given path/period/rollup combination is more than the time required to query Data-Set 2 which is 8640 rows of data (where # of elements in the data list for each row is 10).

Also if I run a performance test with 10 users accessing Data-Set1 simultaneously, then I start seeing cassandra timeouts in the backend, and it spends a lot of time in Garbage collection, but the same does not happen when I do the same by querying Data-Set2.

So I am concluding that the number of elements in the data list is affecting performance.

Are you seeing similar performance issues in your cassandra stack....?

like image 581
Vikrant Sonone Avatar asked Jun 17 '15 21:06

Vikrant Sonone


People also ask

Is Cassandra Read optimized?

Cassandra has an excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-case.

Is Cassandra DB fast?

Writing to in-memory data structure is much faster than writing to disk. Because of this, Cassandra writes are extremely fast!


1 Answers

I wouldn't think that 90 items in a collection would be that big of a deal, but in your case I guess it is. The problem is that when you query a collection column, Cassandra can't just return parts of the collection. It has to return the entire column (collection). That operation isn't free, but I wouldn't think that 90 doubles would be a big deal.

One thing to try is to turn tracing on. That should give you some idea of what Cassandra is doing when you are running your query.

aploetz@cqlsh:stackoverflow> tracing on;

Often, turning on tracing can lead you to the cuplrit.

it spends a lot of time in Garbage collection

Are you using any special JVM settings? How much RAM do you have on each node? GC that interrupts normal operations indicates (to me) that there might be an issue with your JVM heap settings. The DataStax doc on Tuning Java Resources indicates that you should use the following guidelines on sizing your heap, based on your node's RAM:

System Memory       Heap Size

Less than 2GB       1/2 of system memory
2GB to 4GB          1GB
Greater than 4GB    1/4 system memory, but not more than 8GB
like image 63
Aaron Avatar answered Sep 19 '22 13:09

Aaron