Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

key validation class type in cassandra UTF8 or LongType?

Using cassandra, I want to store 20 million+ of row key in column family.

my question is:

  1. Is there a REAL performance difference between long and utf8 rowKey keys?

  2. any,row key storage size problem?

my userkey look like this

rowKey=>112512462152451
rowKey=>135431354354343
rowKey=>145646546546463
rowKey=>154354354354354
rowKey=>156454343435435
rowKey=>154435435435745
like image 597
Gabber Avatar asked Dec 25 '22 14:12

Gabber


2 Answers

  1. Cassandra stores all data on disk (including row key values) as a hex byte array. In terms of performance, the datatype of the row key really doesn't matter. The only place that it does matter, is that the type validator/comparator of the row key will affect the on-disk sort order. So in your case, a Long will sort differently (numerical) than a UTF8 (ascii-betical).

  2. I can't find an exact source on this, but I recall reading that the max size of a row key is 64K (and you appear to be way under that). Key caching is enabled by default and will cache 200,000 keys unless otherwise specified. Whether caching 200,000 keys at any given time is enough, is up to the requirements of your application. You can increase that based on the amount of available RAM, but you should test that in small incremental adjustments.

Check the Datastax docs for instructions on how to tune the row and key cache properties.

Also eBay posted a good article on Cassandra data modeling that discusses proper row key selection/creation that might also be of help to you.

like image 107
Aaron Avatar answered Jan 05 '23 01:01

Aaron


  1. No.
  2. Generally you don't want to have your row keys to be too big. This is because your index files on disk will get large and not fit in memory, so if a key is not cached you have to end up going to disk for key lookup as well. How big really depends on your hardware resources.

In Cassandra 1.1 there used to be a problem where the code:

https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob;f=src/java/org/apache/cassandra/service/CacheService.java;hb=02672936#l102

Would use a constant value of 48 bytes as an average for key cache row size to estimate the amount of memory used by key cache. If someone was having long keys, the code logic will end up causing more heap usage of key cache than what it was configured in cassandra.yaml. This was fixed in Cassandra 1.2.

I usually advise my devs to not have keys beyond 32 bytes if they can.

like image 27
Arya Avatar answered Jan 05 '23 03:01

Arya