I've been using mySQL for an app for some time, and the more data I collect, the slower it gets. So I have been looking into NOSQL options. One of the things I have in mySQL is a View created from a bunch of joins. The app shows all the important info in a grid, and the user can select ranges, do searches, etc. On this data set. Standard Query stuff.
Looking at Cassandra everything is already sorted based on the parameters I provide in my storage-conf.xml. So I would have a certain string as my key in the SuperColumn, and keep a bunch of the data in Columns below that. But I can only sort by one Column, and I can't do any real searching within the columns without pulling all the SuperColumns, and looping through the data, right?
I don't want to duplicate data across different ColumnFamilies, so I want to make sure Cassandra is appropriate for me. In Facebook, Digg, Twitter, they have plenty of searching functions, so maybe I am just not seeing the solution.
Is there a way with Cassandra for me to search for or filter specific data values in a SuperColumn, or its associated Column(s)? If not, is there another NOSQL option?
In the example below, it seems I can only query for phatduckk, friend1,John, etc. But what if I wanted to find anyone in the ColumnFamily that lived in city == "Beverley Hills"? Can it be done without returning all records? If so, could I do a search for city == "Beverley Hills" AND state == "CA"? It doesn't seem like I can do either, but I want to make sure and see what my options are.
AddressBook = { // this is a ColumnFamily of type Super
phatduckk: { // this is the key to this row inside the Super CF
friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"},
John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"},
Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"},
Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"},
Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"},
}, // end row
ieure: {
joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"},
William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"},
},
}
Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value. Adding an index on time2 might improve the query performance.
The GROUP BY option can condense all selected rows that share the same values for a set of columns into a single row. Using the GROUP BY option, rows can be grouped at the partition key or clustering column level. Consequently, the GROUP BY option only accepts primary key columns in defined order as arguments.
Cassandra provides standard built-in functions that return aggregate values to SELECT statements. A SELECT expression using COUNT(column_name) returns the number of non-NULL values in a column. A SELECT expression using COUNT(*) returns the number of rows that matched the query. Use COUNT(1) to get the same result.
You "don't want to duplicate data across different ColumnFamilies," but that is how you do this kind of query in Cassandra. See http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With