Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to limit the size of pandas queries on HDF5 so it doesn't go over RAM limit?

Let's say I have a pandas Dataframe

import pandas as pd

df = pd.DataFrame()

df

   Column1    Column2
0  0.189086 -0.093137
1  0.621479  1.551653
2  1.631438 -1.635403
3  0.473935  1.941249
4  1.904851 -0.195161
5  0.236945 -0.288274
6 -0.473348  0.403882
7  0.953940  1.718043
8 -0.289416  0.790983
9 -0.884789 -1.584088
........

An example of a query is df.query('Column1 > Column2')

Let's say you wanted to limit the save of this query, so the object wasn't so large. Is there "pandas" way to accomplish this?

My question is primarily for querying at HDF5 object with pandas. An HDF5 object could be far larger than RAM, and therefore queries could be larger than RAM.

# file1.h5 contains only one field_table/key/HDF5 group called 'df'
store = pd.HDFStore('file1.h5')

# the following query could be too large 
df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])

Is there a pandas/Pythonic way to stop users for executing queries that surpass a certain size?

like image 774
ShanZhengYang Avatar asked Oct 11 '16 21:10

ShanZhengYang


1 Answers

Here is a small demonstration of how to use the chunksize parameter when calling HDFStore.select():

for chunk in store.select('df', columns=['column1', 'column2'],
                          where='column1==5', chunksize=10**6):
    # process `chunk` DF
like image 170
MaxU - stop WAR against UA Avatar answered Oct 15 '22 05:10

MaxU - stop WAR against UA