Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra: Difference b/w TEXT(VARCHAR) and ASCII

I understand that text and varchar are aliases, which store UTF-8 strings. What about ASCII, which in the documentation says "US-ASCII character string"? What's the difference besides encoding?

Is there any size difference? Is the a preferred choice between these two when I'm storing large strings (~500KB)?

like image 359
tpoker Avatar asked Jul 10 '17 16:07

tpoker


1 Answers

Regarding this anwer:

If the data is a piece of text, for example a String in Java, which is encoded in UTF-16 in the runtime, but when serialized in Cassandra with text type then UTF-8 is used. UTF-16 always use 2 bytes per character and sometime 4 bytes, but UTF-8 is space efficient and depending on the character can be 1, 2, 3 or 4 bytes long.

That mean that there's CPU work to serialize such data for encoding/decoding purpose. Also depending on the text for example 158786464563, data will be stored with 12 bytes. That means more space is used and more IO as well.

Note cassandra offers the ascii type that follows the US-ASCII character set and is always using 1 byte per character.


Is there any size difference?

Yes

Is the a preferred choice between these two when I'm storing large strings (~500KB)?

Yes

Because ascii is more space efficient than UTF-8 and UTF-8 is more space efficient than UTF-16. Again all of the things depends how you are serializing/encoding/decoding those data. For more check-out this "what-is-the-advantage-of-choosing-ascii-encoding-over-utf-8"

like image 83
MD Ruhul Amin Avatar answered Nov 04 '22 06:11

MD Ruhul Amin