I find that the sparse versions of a dataframe are actually much larger when saved to disk than dense versions. What am I doing wrong?
test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47
test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')
test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')
....
ls -sh test*
200K test3 16M test4 164K test5 516K test6
Using version 0.12.0
I would ultimately like to efficiently store 10^7 by 60 arrays, with about 10% density, then pull them into Pandas dataframes and play with them.
Edit: Thanks to Jeff for answering the original question. Follow-up question: This appears to only give savings for pickling, and not when using other formats like HDF5. Is pickling my best route?
print shape(array_activity) #This is just 0s and 1s
(1020000, 60)
test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732
test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1 544M 2 477M 3 83M 4
This is data that, as a list of indices in a Matlab .mat file, is less than 12M. I was eager to get it into an HDF5/Pytables format so that I could grab just specific indices (other files are much larger, and take much longer to load into memory), and then readily do Pandasy things to them. Perhaps I am not going about this the right way?
You can't have a DataFrame larger than your machine's RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis.
The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
you are creating a frame that has 4000 columns, and only 4 rows; sparse is dealt with rows-wise, so reverse the dimensions.
In [2]: from numpy import *
In [3]: test = pd.DataFrame(ones((4000,4)))
In [4]: test.ix[:,:] = nan
In [5]: test.ix[0,0] = 47
In [6]: test.to_hdf('test3', 'df')
In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')
In [8]: test.to_pickle('test5')
In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')
In [11]: !ls -sh test3 test4 test5 test6
164K test3 148K test4 160K test5 36K test6
Followup. Your store that you supplied was written in table
format, and as a result saved the dense version (Sparse is not supported for table format which are very flexible and queryable, see docs.
Furthermore, you may want to experiment with saving your file using 2 different representations of the sparse format.
so, here's a sample session:
df =
In [1]: df = pd.read_hdf('store_compressed.h5','test')
In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame
In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)
In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)
In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)
In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb 6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb 6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb 6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb 6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb 6 19:03 test_dense_table.h5
IIRC their is a bug in 0.12 in that to_hdf
doesn't pass all the arguments thru, so you prob want to use:
with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
store.put('test',df)
These are stored basically as a collection of SparseSeries
so if the density is low and non-contiguous then it will not be as minimal as far as size goes. Pandas sparse suite deals better with a smaller number of contiguous blocks, though YMMV. scipy provides some sparse handling tools as well.
Though IMHO, these are pretty trivial sizes for HDF5 files anyhow, you can handle gigantic number of rows; and files sizes into the 10's and 100's of gigabytes can easily be handled (though recommend).
Furthermore you might to consider using a table format if this is indeed a lookup table as you can query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With