Let's say I have a pandas Dataframe
import pandas as pd
df = pd.DataFrame()
df
Column1 Column2
0 0.189086 -0.093137
1 0.621479 1.551653
2 1.631438 -1.635403
3 0.473935 1.941249
4 1.904851 -0.195161
5 0.236945 -0.288274
6 -0.473348 0.403882
7 0.953940 1.718043
8 -0.289416 0.790983
9 -0.884789 -1.584088
........
An example of a query is df.query('Column1 > Column2')
Let's say you wanted to limit the save of this query, so the object wasn't so large. Is there "pandas" way to accomplish this?
My question is primarily for querying at HDF5 object with pandas. An HDF5 object could be far larger than RAM, and therefore queries could be larger than RAM.
# file1.h5 contains only one field_table/key/HDF5 group called 'df'
store = pd.HDFStore('file1.h5')
# the following query could be too large
df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])
Is there a pandas/Pythonic way to stop users for executing queries that surpass a certain size?
Here is a small demonstration of how to use the chunksize
parameter when calling HDFStore.select()
:
for chunk in store.select('df', columns=['column1', 'column2'],
where='column1==5', chunksize=10**6):
# process `chunk` DF
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With