Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sparse dataframe larger on disk than dense version

I find that the sparse versions of a dataframe are actually much larger when saved to disk than dense versions. What am I doing wrong?

test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47

test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')

....
ls -sh test*
200K test3   16M test4  164K test5  516K test6

Using version 0.12.0

I would ultimately like to efficiently store 10^7 by 60 arrays, with about 10% density, then pull them into Pandas dataframes and play with them.


Edit: Thanks to Jeff for answering the original question. Follow-up question: This appears to only give savings for pickling, and not when using other formats like HDF5. Is pickling my best route?

print shape(array_activity) #This is just 0s and 1s
(1020000, 60)

test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732

test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1  544M 2  477M 3   83M 4

This is data that, as a list of indices in a Matlab .mat file, is less than 12M. I was eager to get it into an HDF5/Pytables format so that I could grab just specific indices (other files are much larger, and take much longer to load into memory), and then readily do Pandasy things to them. Perhaps I am not going about this the right way?

like image 713
jeffalstott Avatar asked Feb 06 '14 18:02

jeffalstott


People also ask

How do I make my pandas DataFrame bigger?

You can't have a DataFrame larger than your machine's RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis.

Is there a size limit for pandas DataFrame?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.


1 Answers

you are creating a frame that has 4000 columns, and only 4 rows; sparse is dealt with rows-wise, so reverse the dimensions.

In [2]: from numpy import *

In [3]: test = pd.DataFrame(ones((4000,4)))

In [4]: test.ix[:,:] = nan

In [5]: test.ix[0,0] = 47

In [6]: test.to_hdf('test3', 'df')

In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

In [8]: test.to_pickle('test5')

In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')

In [11]: !ls -sh test3 test4 test5 test6
164K test3  148K test4  160K test5   36K test6

Followup. Your store that you supplied was written in table format, and as a result saved the dense version (Sparse is not supported for table format which are very flexible and queryable, see docs.

Furthermore, you may want to experiment with saving your file using 2 different representations of the sparse format.

so, here's a sample session:

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test')

In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame

In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)

In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)

In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)

In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb  6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb  6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb  6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb  6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb  6 19:03 test_dense_table.h5

IIRC their is a bug in 0.12 in that to_hdf doesn't pass all the arguments thru, so you prob want to use:

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
    store.put('test',df)

These are stored basically as a collection of SparseSeries so if the density is low and non-contiguous then it will not be as minimal as far as size goes. Pandas sparse suite deals better with a smaller number of contiguous blocks, though YMMV. scipy provides some sparse handling tools as well.

Though IMHO, these are pretty trivial sizes for HDF5 files anyhow, you can handle gigantic number of rows; and files sizes into the 10's and 100's of gigabytes can easily be handled (though recommend).

Furthermore you might to consider using a table format if this is indeed a lookup table as you can query.

like image 124
Jeff Avatar answered Sep 28 '22 16:09

Jeff