Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To
kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce
.
Can somebody tell me what are the options available in this regard.
NOT EQUALS " operator is used to search for content where the value of the specified field does not match the specified value. It cannot be used with text fields; see the DOES NOT CONTAIN (" !~ ") operator instead. Typing `field != value` is the same as typing `NOT field = value`.
Cassandra will not allow a part of a primary key to hold a null value. While Cassandra will allow you to create a secondary index on a column containing null values, it still won't allow you to query for those null values. Cassandra does not support the use of NOT or not equal to (!=) operators in the WHERE clause.
Using the SELECT command with the IN keyword. The IN keyword can define a set of clustering columns to fetch together, supporting a "multi-get" of CQL rows. A single clustering column can be defined if all preceding columns are defined for either equality or group inclusion.
null fields don't exist in Cassandra unless you add them yourself. You might be thinking of the CQL data model, which hides certain implementation details in order to have a more understandable data model. Cassandra is sparse, which means that only data that is used is actually stored.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean
columns (i.e.: column isEqualToUnitedStates
with true or false).
2) Otherwise, consider emulating the unsupported query != X
by combining results of two separate queries, < X
and > X
on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget';
Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob'
without wasting too many resources and filter item != 'Widget'
in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With