I'm new to Cassandra and trying out data modelling and range queries.
For learning purpose I want to develop a database where I can store log lines with their LogType and Log generation time. Where I have to answer below query:
Find loglines by LogType between date range.
I Model my database as 2 column families: 1) Log
create column family log with comparator = 'UTF8Type'
and key_validation_class = 'LexicalUUIDType'
and column_metadata=[{column_name: block, validation_class: UTF8Type}];
where I'm planning to store log lines with their logid's
ex: set log['7561a442-24e2-11df-8924-001ff3591711'][blocks]='someText|11-17-2011 23:40:42|sometext';
2)
create column family ltype with column_type = 'Super'
and comparator = 'TimeUUIDType'
and subcomparator = 'UTF8Type'
and column_metadata=[{column_name: id, validation_class: LexicalUUIDType}];
In this column family I will store the log type along with time and the log line id from log column family:
ex: set ltype[ltype1][12307245916538][id]='7561a442-24e2-11df-8924-001ff3591711';
I want to get the results when given type of Log and date range.
Can someone guide me how to run a range query on super column family?
No Range Queries in Cassandra.
Certain types of query in Cassandra will lead to performing an expensive operation known as a range slice. Under some circumstances, range slices can cause high latency, long GC pauses, and node instability. This article provides advice for identifying and minimising the impact of range slices.
Using CQL to create a secondary index on a column after defining a table. Using CQL, you can create an index on a column after defining a table. You can also index a collection column. Secondary indexes are used to query a table using a column that is not normally queryable.
An article on time series data modelling in Cassandra:
http://rubyscale.com/2011/basic-time-series-with-cassandra/
For time series, you really want to do larger rows - probably in the neighborhood of 10k-50k columns per row as a starting point (depending on your load). You can avoid super columns completely if you make the key a function of the a "date bucket":
[datetime]_[5 second interval] (granularity again depending on load)
This way your keys can be re-created, and you are just issuing a multi_get with the keys for the buckets you want.
A more general overview of data modeling:
http://www.datastax.com/docs/0.8/ddl/index
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With