We are currently using Cassandra (http://cassandra.apache.org/) for Time Series data. Cassandra is very quick on reads, but we have to perform a series of calculations on our data before we present it (effectively we are mimicking the SUM and GROUP BY functionalities of SQL - Something Cassandra does not support out of the box)
We are familiar with Python (to a degree) and decided to build a script to query our Cassandra cluster as well as perform the math and present the outcome in a JSON format:
query = (
"SELECT query here...")
startTimeQuery = time.time()
# Executes cassandra query
rslt = cassession.execute(query)
print("--- %s seconds to query ---" % (time.time() - startTimeQuery))
tally = {}
startTimeCalcs = time.time()
for row in rslt:
userid = row.site_user_id
revenue = (int(row.revenue) - int(row.reversals_revenue or 0))
accepted = int(row.accepted or 0)
reversals_revenue = int(row.reversals_revenue or 0)
error = int(row.error or 0)
impressions_negative = int(row.impressions_negative or 0)
impressions_positive = int(row.impressions_positive or 0)
rejected = int(row.rejected or 0)
reversals_rejected = int(row.reversals_rejected or 0)
if tally.has_key(userid):
tally[userid]["revenue"] += revenue
tally[userid]["accepted"] += accepted
tally[userid]["reversals_revenue"] += reversals_revenue
tally[userid]["error"] += error
tally[userid]["impressions_negative"] += impressions_negative
tally[userid]["impressions_positive"] += impressions_positive
tally[userid]["rejected"] += rejected
tally[userid]["reversals_rejected"] += reversals_rejected
else:
tally[userid] = {
"accepted": accepted,
"error": error,
"impressions_negative": impressions_negative,
"impressions_positive": impressions_positive,
"rejected": rejected,
"revenue": revenue,
"reversals_rejected": reversals_rejected,
"reversals_revenue": reversals_revenue
}
print("--- %s seconds to calculate results ---" % (time.time() - startTimeCalcs))
startTimeJson = time.time()
jsonOutput =json.dumps(tally)
print("--- %s seconds for json dump ---" % (time.time() - startTimeJson))
print("--- %s seconds total ---" % (time.time() - startTimeQuery))
print "Array Size: " + str(len(tally))
This is the sort of output we get:
--- 0.493520975113 seconds to query ---
--- 23.1472680569 seconds to calculate results ---
--- 0.546246051788 seconds for json dump ---
--- 24.1871240139 seconds total ---
Array Size: 198124
We are spending a large amount of time on our calculations, we know the issue isn't so much the sums and group bys themselves: It's just the sheer size of the array that is the issue.
We have heard a few good things about numpy, but the nature of our data makes the matrix size an unknown.
We are looking for any tips on how to approach this. Including a completely different programming approach.
I have done a very similar piece of processing and I was also worried about processing times. I think you are not accounting for something important: the result object you receive from cassandra as return of the function execute()
does not contain all the lines you want. instead, it contains a paginated result, and will obtain lines as you sweep through the object inside the for
list. This is based on personal observation though, I do not know of more technical details to provide about this.
I suggest you isolate query and processing of results by adding a simple rslt = list(rslt)
right after the execute
command, that would force python to go through all the lines in the result before doing the processing, also forcing the cassandra driver to obtain all the lines you want before going to processing.
I think you'll find that a lot of the processing time you had was actually querying but it was masked by the driver via a paginated result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With