Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra = Memory/Encoding-Footprint of Keys (Hash/Bytes[]=>Hex=>UTF16=>Bytes[])

I am trying to understand the implications of using an MD5 Hash as Cassandra Key, in terms of "memory/storage consumption":

  1. MD5 Hash of my content (in Java) = byte[] is 16 bytes long. (16 bytes is from wikipedia for generic md5, I am not shure if the java implementations also returns 16 bytes)
  2. Hex encode this value, to be able to print it in human readable format => 1byte becomes 2hex values
  3. I have to represent every hex value as a "character" in java => result= "two string character values" (for examle "FF" is a string of length/size = 2.)
  4. Java uses UTF-16 => so every "string character" is encoded with two bytes. "FF" would require 2x2 bytes?
  5. Conclusion => The MD5 Hash in Bytes format is 16 bytes, but represented as a java hex utf16 string consumes 16x2x2 = 64Bytes (in memory)!?!? Is this correct?

What is the storage Consumption in Cassandra, using this as a row-key?

If I had directly used the byte-array from the Hash function i would assume it consumes 16 bytes in Cassandra?

But if I use the hex-String representation (as noted above), can cassandra "compress" it to a 16 byets or will it also take 64bytes in cassandra? I assume 64 bytes in Cassandra, is this correct?

What kind of keys do you use? Do you use directly the outpout of an hash function or do you first encode into a hex string and then use the string? (In MySQL I always, whenever I used a hash-key, I used the hex-string representation of it...So it is directly readable in the MySQL Tools and in the whole application. But I now realize it wastes storage???)

Maybe my thinking is completely incorrect, then it would be kind to explain where I am wrong.

Thans very much! jens

like image 341
jens Avatar asked Nov 05 '22 22:11

jens


1 Answers

Correct on both counts: byte[] would be 16 bytes, utf16-as-hex would be 64.

In 0.8, Cassandra has key metadata so you can tell it "this key is a byte[]" and it will display in hex in the cli.

like image 121
jbellis Avatar answered Nov 12 '22 10:11

jbellis