Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large web data sets in Python - how to deal with very big arrays?

We are currently using Cassandra (http://cassandra.apache.org/) for Time Series data. Cassandra is very quick on reads, but we have to perform a series of calculations on our data before we present it (effectively we are mimicking the SUM and GROUP BY functionalities of SQL - Something Cassandra does not support out of the box)

We are familiar with Python (to a degree) and decided to build a script to query our Cassandra cluster as well as perform the math and present the outcome in a JSON format:

query = (
    "SELECT query here...")

startTimeQuery = time.time()

# Executes cassandra query
rslt = cassession.execute(query)

print("--- %s seconds to query ---" % (time.time() - startTimeQuery))

tally = {}

startTimeCalcs = time.time()
for row in rslt:
    userid = row.site_user_id

    revenue = (int(row.revenue) - int(row.reversals_revenue or 0))
    accepted = int(row.accepted or 0)
    reversals_revenue = int(row.reversals_revenue or 0)
    error = int(row.error or 0)
    impressions_negative = int(row.impressions_negative or 0)
    impressions_positive = int(row.impressions_positive or 0)
    rejected = int(row.rejected or 0)
    reversals_rejected = int(row.reversals_rejected or 0)

    if tally.has_key(userid):
        tally[userid]["revenue"] += revenue
        tally[userid]["accepted"] += accepted
        tally[userid]["reversals_revenue"] += reversals_revenue
        tally[userid]["error"] += error
        tally[userid]["impressions_negative"] += impressions_negative
        tally[userid]["impressions_positive"] += impressions_positive
        tally[userid]["rejected"] += rejected
        tally[userid]["reversals_rejected"] += reversals_rejected
    else:
        tally[userid] = {
            "accepted": accepted,
            "error": error,
            "impressions_negative": impressions_negative,
            "impressions_positive": impressions_positive,
            "rejected": rejected,
            "revenue": revenue,
            "reversals_rejected": reversals_rejected,
            "reversals_revenue": reversals_revenue
        }


print("--- %s seconds to calculate results ---" % (time.time() - startTimeCalcs))

startTimeJson = time.time()
jsonOutput =json.dumps(tally)
print("--- %s seconds for json dump ---" % (time.time() - startTimeJson))

print("--- %s seconds total ---" % (time.time() - startTimeQuery))

print "Array Size: " + str(len(tally)) 

This is the sort of output we get:

--- 0.493520975113 seconds to query ---
--- 23.1472680569 seconds to calculate results ---
--- 0.546246051788 seconds for json dump ---
--- 24.1871240139 seconds total ---
Array Size: 198124

We are spending a large amount of time on our calculations, we know the issue isn't so much the sums and group bys themselves: It's just the sheer size of the array that is the issue.

We have heard a few good things about numpy, but the nature of our data makes the matrix size an unknown.

We are looking for any tips on how to approach this. Including a completely different programming approach.

like image 992
joao Avatar asked Nov 08 '22 20:11

joao


1 Answers

I have done a very similar piece of processing and I was also worried about processing times. I think you are not accounting for something important: the result object you receive from cassandra as return of the function execute() does not contain all the lines you want. instead, it contains a paginated result, and will obtain lines as you sweep through the object inside the for list. This is based on personal observation though, I do not know of more technical details to provide about this.

I suggest you isolate query and processing of results by adding a simple rslt = list(rslt) right after the execute command, that would force python to go through all the lines in the result before doing the processing, also forcing the cassandra driver to obtain all the lines you want before going to processing.

I think you'll find that a lot of the processing time you had was actually querying but it was masked by the driver via a paginated result.

like image 93
Lucas Franceschi Avatar answered Nov 14 '22 23:11

Lucas Franceschi