Is there a simple way/module to do a group-by operation in python for datasets too large to fit in the memory?
I would usually use pandas, but it breaks down for large datasets.
In a comment, @steboc mentioned using sqlite as a possible solution. You could use any database as a backend for this, but sqlite is reasonably fast and requires almost zero set up. Here's an example of writing a bunch of junk to sqlite then reading it back in groups:
Start with loading a few packages and setting up the environment:
import pandas as pd
import sqlite3
import string
## connect to a db. This db will be created if it does not exist
conn = sqlite3.connect('example.db')
c = conn.cursor()
np.random.seed(123)
## create some random data in a pandas dataframe
n = 1000000
c = 10
Let's loop through 30 times and each time we'll create a dataframe with 1mm records, 10 numeric fields and a key that's simply a letter. We'll poke that dataframe in the sqlite database. At the end of the loop we'll have 30mm rows in our database. This takes ~15 minutes on my MBP:
%%time
for i in arange(30):
df = pd.DataFrame(np.random.randn(n, c), columns=list(map(chr, range(65, 65+c))))
df['key'] = string.ascii_letters[i]
df.to_sql(name='test_table', if_exists='append', con=conn)
Now if we were going to do an operation on all this data grouping by the values in the field key
we'd first need to get all the unique values for the key. One way is to do this:
%%time
keys_df = pd.read_sql(sql='SELECT DISTINCT key FROM test_table', con=conn)
keys_df
Now we have keys_df
which is a dataframe with one column containing all the unique values for `key'. Now we can iterate through each group and extract only that group from the database and do our grouped operation. The example here does a trivial describe():
%%time
for row in keys_df.iterrows():
tempdf = pd.read_sql(sql='SELECT * FROM test_table WHERE key=\'' + row[1][0] + '\';', con=conn)
## do anything you want with your group here.
## we'll print describe just for fun
print tempdf.describe()
print ""
Obviously in real life you'd put the values in a data structure.
I hope this helps illustrate how to use sqlite and pandas to iterate over groups of data.
The Blaze project is happy to manage large datasets by chunking the dataset and then using Pandas on each chunk. The Blaze out-of-core docs might interest you. And here is an explicit example on the NYC Taxi dataset.
If you prefer a slower, pure-python solution then the toolz
project might interest you. Here are its documents on streaming analytics
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With