According to this issue, Cassandra's storage format was updated in 3.0.
If previously I could use cassandra-cli to see how the SSTable is built, to get something like this:
[default@test] list phonelists;
-------------------
RowKey: scott
=> (column=, value=, timestamp=1374684062860000)
=> (column=phonenumbers:bill, value='555-7382', timestamp=1374684062860000)
=> (column=phonenumbers:jane, value='555-8743', timestamp=1374684062860000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=1374684062860000)
-------------------
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=phonenumbers:doug, value='555-1579', timestamp=1374683971220000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=137468397122
What would the internal formal look like in the latest version of Cassandra? Could you provide an example?
What utility can I use to see the internal representation of the table in Cassandra in a way listed above, but with a new SSTable format?
All that I have found on the internet is that the partition header how stores column names, row stores clustering values and that there are no duplicated values.
How can I look into it?
Prior to 3.0 sstable2json was a useful utility for getting an understanding of how data is organized in SSTables. This feature is not currently present in cassandra 3.0, but there will be an alternative eventually. Until then myself and Chris Lohfink have developed an alternative to sstable2json (sstable-tools) for Cassandra 3.0 which you can use to understand how data is organized. There is some talk about bringing this into cassandra proper in CASSANDRA-7464.
A key differentiator between the storage format between older verisons of Cassandra and Cassandra 3.0 is that an SSTable was previously a representation of partitions and their cells (identified by their clustering and column name) whereas with Cassandra 3.0 an SSTable now represents partitions and their rows.
You can read about these changes in more detail by visiting this blog post by the primary developer of these changes who does a great job explaining it in detail.
The largest benefit you will see is that in the general case your data size will shrink (in some cases by a large factor), as a lot of the overhead introduced by CQL has been eliminated by some key enhancements.
Here's an example showing the difference between C* 2 and 3.
Schema:
create keyspace demo with replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
use demo;
create table phonelists (user text, person text, phonenumbers text, primary key (user, person));
insert into phonelists (user, person, phonenumbers) values ('scott', 'bill', '555-7382');
insert into phonelists (user, person, phonenumbers) values ('scott', 'jane', '555-8743');
insert into phonelists (user, person, phonenumbers) values ('scott', 'patricia', '555-4326');
insert into phonelists (user, person, phonenumbers) values ('john', 'doug', '555-1579');
insert into phonelists (user, person, phonenumbers) values ('john', 'patricia', '555-4326');
sstable2json C* 2.2 output:
[
{"key": "scott",
"cells": [["bill:","",1451767903101827],
["bill:phonenumbers","555-7382",1451767903101827],
["jane:","",1451767911293116],
["jane:phonenumbers","555-8743",1451767911293116],
["patricia:","",1451767920541450],
["patricia:phonenumbers","555-4326",1451767920541450]]},
{"key": "john",
"cells": [["doug:","",1451767936220932],
["doug:phonenumbers","555-1579",1451767936220932],
["patricia:","",1451767945748889],
["patricia:phonenumbers","555-4326",1451767945748889]]}
]
sstable-tools toJson C* 3.0 output:
[
{
"partition" : {
"key" : [ "scott" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "bill" ],
"liveness_info" : { "tstamp" : 1451768259775428 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-7382" }
]
},
{
"type" : "row",
"clustering" : [ "jane" ],
"liveness_info" : { "tstamp" : 1451768259793653 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-8743" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259796202 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
},
{
"partition" : {
"key" : [ "john" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "doug" ],
"liveness_info" : { "tstamp" : 1451768259798802 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-1579" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259908016 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
}
]
While the output is larger (that is more of a consequence of the tool). The key differences you can see are:
I should note that in this particular example data case the benefits of the new storage engine aren't completely realized since there is only 1 non-clustering column.
There are a number of other improvements not shown here (like the ability to store row-level range tombstones).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With