Two tables: schools
and students
. The index (or keys) in SQLite will be id
and time
for the students
table and school
and time
for the schools
table. My dataset is about something different, but I think the school-student example is easier to understand.
import pandas as pd
import numpy as np
import sqlite3
df_students = pd.DataFrame(
{'id': list(range(0,4)) + list(range(0,4)),
'time': [0]*4 + [1]*4, 'school': ['A']*2 + ['B']*2 + ['A']*2 + ['B']*2,
'satisfaction': np.random.rand(8)} )
df_students.set_index(['id', 'time'], inplace=True)
satisfaction school
id time
0 0 0.863023 A
1 0 0.929337 A
2 0 0.705265 B
3 0 0.160457 B
0 1 0.208302 A
1 1 0.029397 A
2 1 0.266651 B
3 1 0.646079 B
df_schools = pd.DataFrame({'school': ['A']*2 + ['B']*2, 'time': [0]*2 + [1]*2, 'mean_scores': np.random.rand(4)})
df_schools.set_index(['school', 'time'], inplace=True)
df_schools
mean_scores
school time
A 0 0.358154
A 0 0.142589
B 1 0.260951
B 1 0.683727
## Send to SQLite3
conn = sqlite3.connect('schools_students.sqlite')
df_students.to_sql('students', conn)
df_schools.to_sql('schools', conn)
I have a bunch of functions that operate over pandas
dataframes and create new columns that should then be inserted in either the schools
or the students
table (depending on what I'm constructing). A typical function does, in order:
pandas
functions such as groupby
, apply
of custom functions, rolling_mean
, etc. (many of them not available on SQL, or difficult to write) to construct a new column. The return type is either pd.Series
or np.array
schools
or students
)These functions were written when I had a small database that fitted in memory so they are pure pandas
.
Here's an example in pseudo-code:
def example_f(satisfaction, mean_scores)
"""Silly function that divides mean satisfaction per school by mean score"""
#here goes the pandas functions I already wrote
mean_satisfaction = mean(satisfaction)
return mean_satisfaction/mean_scores
satisf_div_score = example_f(satisfaction, mean_scores)
# Here push satisf_div_score to `schools` table
Because my dataset is really large, I'm not able to call these functions in memory. Imagine that schools are located in different districts. Originally I only had one district, so I know these functions can work with data from each district separately.
A workflow that I think would work is:
i
i
and produce new columns as np.array or pd.Seriesi
of that columnsi
= 1 to K
Although my dataset is in SQLite (and I'd prefer it to stay that way!) I'm open to migrating it to something else if the benefits are large.
I realize there are different reasonable answers, but it would be great to hear something that has proved useful and simple for you. Thanks!
There are several approaches, you may select which are better for your particular task:
Move all data to "bigger" database. Personally I prefer PostgreSQL - it plays very well with big datasets. Fortunately pandas support SQLAlchemy - cross-database ORM, so you may use the same queries with different databases.
Split data into chunks and calculate for any chunk separately. I'll demo it with PostgreSQL, but you may use any DB.
from sqlalchemy import create_engine
import psycopg2
mydb = create_engine('postgresql://[email protected]:5432/database')
# lets select some groups of data into first dataframe,
# you may use school ids instead of my sections
df=pd.read_sql_query('''SELECT sections, count(id) FROM table WHERE created_at <'2016-01-01' GROUP BY sections ORDER BY 2 DESC LIMIT 10''', con=mydb)
print(df) # don't worry about strange output - sections have type int[] and it's supported well!
sections count
0 [121, 227] 104583
1 [296, 227] 48905
2 [121] 43599
3 [302, 227] 29684
4 [298, 227] 26814
5 [294, 227] 24071
6 [297, 227] 23038
7 [292, 227] 22019
8 [282, 227] 20369
9 [283, 227] 19908
# Now we have some sections and we can select only data related to them
for section in df['sections']:
df2 = pd.read_sql_query('''SELECT sections, name, created_at, updated_at, status
FROM table
WHERE created_at <'2016-01-01'
AND sections=%(section)s
ORDER BY created_at''',
con=mydb, params=dict(section=section))
print(section, df2.std())
[121, 227] status 0.478194
dtype: float64
[296, 227] status 0.544706
dtype: float64
[121] status 0.499901
dtype: float64
[302, 227] status 0.504573
dtype: float64
[298, 227] status 0.518472
dtype: float64
[294, 227] status 0.46254
dtype: float64
[297, 227] status 0.525619
dtype: float64
[292, 227] status 0.627244
dtype: float64
[282, 227] status 0.362891
dtype: float64
[283, 227] status 0.406112
dtype: float64
Of course this example is synthetic - it's quite ridiculous to calculate average status on articles :) But it demonstrates how to split lots of data and treat it in portions.
Use specific PostgreSQL (or Oracle or MS or any you like) for statistics. Here's excellent documentations on Window Functions in PostgreSQL. Luckily you may perform some calcs in DB and move prefabbed data to DataFrame as above.
UPDATE: How to load information back to database.
Fortunately, DataFrame support method to_sql to make this process easy:
from sqlalchemy import create_engine
mydb = create_engine('postgresql://[email protected]:5432/database')
df2.to_sql('tablename', mydb, if_exists='append', chunksize=100)
You may specify action you need: if_exists='append'
add rows to table, if you have a lot of rows you may split them to chunks, so db could insert them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With