Cassandra Wide Vs Skinny Rows for large columns

Tags:

I need to insert 60GB of data into cassandra per day.

This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key

In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row

Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration

More specific details about my configuration:

CREATE TABLE stuff (
  stuff_id text,
  stuff_column text,
  value blob,
  PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.100000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=39600 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.

A typical request is going to need a random selection of 20 - 60 of those floats.

Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.

As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.

Test Method I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.

Write results Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps

528

asked Sep 26 '13 21:09

cs_alumnus

1 Answers

The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:

Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.

I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.

answered Sep 30 '22 04:09

Nikhil

Related questions
                            
                                Should I declare a java field 'final' when if it's not modified in code?
                            
                                Why cgo's performance is so slow? is there something wrong with my testing code?
                            
                                React native android transitions are really slow
                            
                                Intelliji idea is very slow in debug mode and it is running perfectly in normal mode
                            
                                Why are double preferred over float? [closed]
                            
                                draw 10,000 objects on canvas javascript
                            
                                How to use Rcpp to speed up a for loop?
                            
                                Fastest way to separate the digits of an int into an array in .NET?
                            
                                When and why should you use NSUserDefaults's synchronize() method?
                            
                                How to compute volatility (standard deviation) in rolling window in Pandas
                            
                                C++ function parameters: use a reference or a pointer (and then dereference)?
                            
                                How to write code in Visual Studio faster? [closed]
                            
                                OpenCV: C++ and C performance comparison
                            
                                Unable to start uiautomatorviewer
                            
                                int, short, byte performance in back-to-back for-loops
                            
                                Most Efficient way to test large number of strings against a large List<string>
                            
                                fastest way to replace string in a template
                            
                                JavaScript, sorting with second parameter is faster
                            
                                Reasons for high CPU usage in SocketInputStream.socketRead0()
                            
                                Fastest HOG Feature Extraction implementation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cassandra Wide Vs Skinny Rows for large columns

Tags:

performance

schema

cassandra

cs_alumnus

People also ask

1 Answers

Nikhil

Recent Activity

Donate For Us