Pandas sparse dataframe larger on disk than dense version

Tags:

I find that the sparse versions of a dataframe are actually much larger when saved to disk than dense versions. What am I doing wrong?

test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47

test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')

....
ls -sh test*
200K test3   16M test4  164K test5  516K test6

Using version 0.12.0

I would ultimately like to efficiently store 10^7 by 60 arrays, with about 10% density, then pull them into Pandas dataframes and play with them.

Edit: Thanks to Jeff for answering the original question. Follow-up question: This appears to only give savings for pickling, and not when using other formats like HDF5. Is pickling my best route?

print shape(array_activity) #This is just 0s and 1s
(1020000, 60)

test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732

test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1  544M 2  477M 3   83M 4

This is data that, as a list of indices in a Matlab .mat file, is less than 12M. I was eager to get it into an HDF5/Pytables format so that I could grab just specific indices (other files are much larger, and take much longer to load into memory), and then readily do Pandasy things to them. Perhaps I am not going about this the right way?

713

asked Feb 06 '14 18:02

jeffalstott

1 Answers

you are creating a frame that has 4000 columns, and only 4 rows; sparse is dealt with rows-wise, so reverse the dimensions.

In [2]: from numpy import *

In [3]: test = pd.DataFrame(ones((4000,4)))

In [4]: test.ix[:,:] = nan

In [5]: test.ix[0,0] = 47

In [6]: test.to_hdf('test3', 'df')

In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

In [8]: test.to_pickle('test5')

In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')

In [11]: !ls -sh test3 test4 test5 test6
164K test3  148K test4  160K test5   36K test6

Followup. Your store that you supplied was written in table format, and as a result saved the dense version (Sparse is not supported for table format which are very flexible and queryable, see docs.

Furthermore, you may want to experiment with saving your file using 2 different representations of the sparse format.

so, here's a sample session:

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test')

In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame

In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)

In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)

In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)

In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb  6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb  6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb  6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb  6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb  6 19:03 test_dense_table.h5

IIRC their is a bug in 0.12 in that to_hdf doesn't pass all the arguments thru, so you prob want to use:

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
    store.put('test',df)

These are stored basically as a collection of SparseSeries so if the density is low and non-contiguous then it will not be as minimal as far as size goes. Pandas sparse suite deals better with a smaller number of contiguous blocks, though YMMV. scipy provides some sparse handling tools as well.

Though IMHO, these are pretty trivial sizes for HDF5 files anyhow, you can handle gigantic number of rows; and files sizes into the 10's and 100's of gigabytes can easily be handled (though recommend).

Furthermore you might to consider using a table format if this is indeed a lookup table as you can query.

124

answered Sep 28 '22 16:09

Jeff

Related questions
                            
                                How to plot f(x) as x goes to infinity with matplotlib?
                            
                                Kivy refresh layout (Async Loading)
                            
                                Iteration over lines in a text file, returning line numbers and occurrences?
                            
                                Why two individually created immutable objects have same id and mutable objects have different while both refer to same values? [duplicate]
                            
                                Duplicating some rows and changing some values in pandas
                            
                                ZeroMQ: HWM on PUSH does not work
                            
                                number to text in django
                            
                                How to batch delete records using PyMongo
                            
                                Pandas: getting the name of the minimum column
                            
                                Python difference between __import__ and import as
                            
                                Python alternative to fscanf C code
                            
                                How to make Python choose between two options?
                            
                                Fast way to select n items (drawn from a Poisson distribution) for each element in array x
                            
                                Autobahn cannot import name error
                            
                                Django sum of row in template for loop
                            
                                OperationalError: No such Column
                            
                                Fast selection of a time interval in a pandas DataFrame/Series
                            
                                Django TypeError int() argument must be a string or a number, not 'QueryDict'
                            
                                Python script involving Outlook through win32com runs when double-clicking, but not through task scheduler
                            
                                Count number of results for a particular word on Twitter (API v1.1)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas sparse dataframe larger on disk than dense version

Tags:

python

pandas

sparse-matrix

sparse-array

jeffalstott

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us