Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group By operation for large dataset

Is there a simple way/module to do a group-by operation in python for datasets too large to fit in the memory?

I would usually use pandas, but it breaks down for large datasets.

like image 430
Alexis Eggermont Avatar asked Dec 19 '14 04:12

Alexis Eggermont


2 Answers

In a comment, @steboc mentioned using sqlite as a possible solution. You could use any database as a backend for this, but sqlite is reasonably fast and requires almost zero set up. Here's an example of writing a bunch of junk to sqlite then reading it back in groups:

Start with loading a few packages and setting up the environment:

import pandas as pd
import sqlite3
import string


## connect to a db. This db will be created if it does not exist
conn = sqlite3.connect('example.db')
c = conn.cursor()

np.random.seed(123)
## create some random data in a pandas dataframe
n = 1000000
c = 10

Let's loop through 30 times and each time we'll create a dataframe with 1mm records, 10 numeric fields and a key that's simply a letter. We'll poke that dataframe in the sqlite database. At the end of the loop we'll have 30mm rows in our database. This takes ~15 minutes on my MBP:

%%time
for i in arange(30):
    df = pd.DataFrame(np.random.randn(n, c), columns=list(map(chr, range(65, 65+c))))
    df['key'] = string.ascii_letters[i]
    df.to_sql(name='test_table', if_exists='append', con=conn)

Now if we were going to do an operation on all this data grouping by the values in the field key we'd first need to get all the unique values for the key. One way is to do this:

%%time
keys_df = pd.read_sql(sql='SELECT DISTINCT key FROM test_table', con=conn)
keys_df

Now we have keys_df which is a dataframe with one column containing all the unique values for `key'. Now we can iterate through each group and extract only that group from the database and do our grouped operation. The example here does a trivial describe():

%%time
for row in keys_df.iterrows():
    tempdf = pd.read_sql(sql='SELECT * FROM test_table WHERE key=\'' + row[1][0] + '\';', con=conn)
    ## do anything you want with your group here. 
    ## we'll print describe just for fun
    print tempdf.describe()
    print ""

Obviously in real life you'd put the values in a data structure.

I hope this helps illustrate how to use sqlite and pandas to iterate over groups of data.

like image 155
JD Long Avatar answered Oct 13 '22 00:10

JD Long


The Blaze project is happy to manage large datasets by chunking the dataset and then using Pandas on each chunk. The Blaze out-of-core docs might interest you. And here is an explicit example on the NYC Taxi dataset.

If you prefer a slower, pure-python solution then the toolz project might interest you. Here are its documents on streaming analytics

like image 20
MRocklin Avatar answered Oct 12 '22 23:10

MRocklin