cassandra Read performance with Collection

Tags:

I have the following columnfamily defined in cassandra

CREATE TABLE metric (
period int,
rollup int,
tenant text,
path text,
time bigint,
data list&lt;double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

Does the size of data list affect the read performance in cassandra ? If yes how can we measure it..?

The issue is that the time taken to query Data-Set1 from cassandra to get 8640 rows (where # of elements in the data list for each row is 90) for a given path/period/rollup combination is more than the time required to query Data-Set 2 which is 8640 rows of data (where # of elements in the data list for each row is 10).

Also if I run a performance test with 10 users accessing Data-Set1 simultaneously, then I start seeing cassandra timeouts in the backend, and it spends a lot of time in Garbage collection, but the same does not happen when I do the same by querying Data-Set2.

So I am concluding that the number of elements in the data list is affecting performance.

Are you seeing similar performance issues in your cassandra stack....?

581

asked Jun 17 '15 21:06

Vikrant Sonone

1 Answers

I wouldn't think that 90 items in a collection would be that big of a deal, but in your case I guess it is. The problem is that when you query a collection column, Cassandra can't just return parts of the collection. It has to return the entire column (collection). That operation isn't free, but I wouldn't think that 90 doubles would be a big deal.

One thing to try is to turn tracing on. That should give you some idea of what Cassandra is doing when you are running your query.

aploetz@cqlsh:stackoverflow> tracing on;

Often, turning on tracing can lead you to the cuplrit.

it spends a lot of time in Garbage collection

Are you using any special JVM settings? How much RAM do you have on each node? GC that interrupts normal operations indicates (to me) that there might be an issue with your JVM heap settings. The DataStax doc on Tuning Java Resources indicates that you should use the following guidelines on sizing your heap, based on your node's RAM:

System Memory       Heap Size

Less than 2GB       1/2 of system memory
2GB to 4GB          1GB
Greater than 4GB    1/4 system memory, but not more than 8GB

answered Sep 19 '22 13:09

Aaron

Related questions
                            
                                What are the pros and cons of passing *all* javascript variables (even undefined) as function arguments? [closed]
                            
                                Whats [ASP.net]MVC doing BEFORE my controller?
                            
                                Alternative for LINQ's .Contains() [duplicate]
                            
                                why threadpool implementation is slower than normal threads
                            
                                How does NSMutableArray achieve such high speed in fast enumeration
                            
                                Seek through massive data grouped with multiple keys C#
                            
                                How do I use a typed array using a shared buffer efficiently in JavaScript?
                            
                                for loop: why is i++ slower than i = i + 1 in swift
                            
                                Android default threads and their use
                            
                                Exception handling in the same function slows compile times by > 2x, why?
                            
                                Arpack++ sparse eigen solver many times slower than equivalent Matlab eigs()
                            
                                What is the fastest way to extract given rows and columns from a Numpy ndarray?
                            
                                Java: fastest way to serialize to a byte buffer
                            
                                Atomic Integer lazySet performance gains
                            
                                C2 Compiler saturating CPU at startup
                            
                                why django SimpleTestCase create test database
                            
                                Fastest 64-bit population count (Hamming weight)
                            
                                Firebird Slow performance when ORDER BY
                            
                                Split an single-use large IEnumerable<T> in half using a condition
                            
                                Any performance benefits to removing items from C# Dictionary after lookup if they only need to be read once

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

cassandra Read performance with Collection

Tags:

performance

cassandra

cql

Vikrant Sonone

People also ask

1 Answers

Aaron

Recent Activity

Donate For Us