Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra performance for long rows

Tags:

cassandra

I'm looking at implementing a CF in Cassandra that has very long rows (hundreds of thousands to millions of columns per row).

Using entirely dummy data, I've inserted 2 million columns into a single row (evenly spaced). If I do a slice operation to get 20 columns, then I'm noticing a massive performance degradation as you do your slice operation further down the row.

With most of the columns, I seem to be able to serve up slice results in 10-40ms, but as you get towards the end of the row, performance hits the wall, with response times gradually increasing from 43ms at the 1,800,000 mark to 214ms at 1,900,000 and 435ms at 1,999,900! (All slices are of equal width).

I'm at a loss to explain why there is this massive degradation in performance as you get to the end of the row. Can someone please provide some guidance as to what Cassandra's doing internally to make such a delay? Row caching is turned off and pretty much everything is a default Cassandra 1.0 installation.

It's supposed to be able to support up to 2 billion columns per row, but at this rate of increase performance will mean that it can't be used for very long rows in a practical situation.

Many thanks.

Caveat, I'm hitting this with 10 requests in parallel at a time which is why they are a bit slower than I'd expect anyway, but it's a fair test across all requests and even just doing them all in serial there is this strange degradation between the 1,800,000th and 1,900,000th record.

I've also noticed EXTREMELY bad performance when doing reverse slices for just a single item when having just 200,000 columns per row: query.setRange(end, start, false, 1);

like image 240
agentgonzo Avatar asked Mar 16 '12 17:03

agentgonzo


1 Answers

A good resource on this is Aaron Morton's blog post on Cassandra's Reversed Comparators. From the article:

Recall from my post on Cassandra Query Plans that once rows get to a certain size they include an index of the columns. And that the entire index must be read whenever any part of the index needs to be used, which is the case when using a Slice Range that specifies start or reversed. So the fastest slice query to run against a row was one that retrieved the first X columns in a row by only specifying a column count.

If you are mostly reading from the end of a row (for example if you are storing things by timestamp and you mostly want to look at recent data) you can use the Reversed Comparator which stores you columns in descending order. This will give you much better (and more consistent) query performance.

If your read patterns are more random you might be better off partitioning your data across multiple rows.

like image 195
psanford Avatar answered Sep 30 '22 04:09

psanford