Why does a pandas dataframe with sparse columns take up more memory?

Tags:

I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:

import numpy as np
import pandas as pd

a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1

df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())

This prints memory usage: 228.9 MB. Next:

df = pd.DataFrame({'a':a, 'b':b})
print(df.info())

This prints memory usage: 152.6 MB.

Does the non-sparse dataframe take up less space? Am I misunderstanding?

Installation info:

pandas 0.25.0
python 3.7.2

696

asked Sep 11 '19 11:09

FChm

1 Answers

I've reproduced those exact numbers. From the docs:

Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:

df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})

The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.

As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess

Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't

answered Oct 03 '22 17:10

Ofer Sadan

Related questions
                            
                                Python library functions taking no keyword arguments
                            
                                Simple Python TCP forking server using asyncio
                            
                                Comments in Python MANIFEST.in
                            
                                error: bad escape (end of pattern) at position 0 while trying to replace to backslah
                            
                                How to emulate multiprocessing.Pool.map() in AWS Lambda?
                            
                                Why ColumnTransformer does not call fit on its transformers?
                            
                                Compare the previous N rows to the current row in a pandas column
                            
                                Get the file size of the uploaded file in Django app
                            
                                zip list elements in different dataframe columns
                            
                                PytestDeprecationWarning at test setup: the funcargnames attribute was an alias for fixturenames
                            
                                Performance issue while reading data from hive using python
                            
                                Jupyter "500: Internal Server Error"; "ImportError: cannot import name ConverterMapping"
                            
                                Default Adam optimizer doesn't work in tf.keras but string `adam` does
                            
                                How to check if celery task is already running before running it again with beat?
                            
                                Dividing a list of numbers in two groups such that numbers in one group don't have any factor common with the numbers in the other group
                            
                                How to look up identical column names in two dataframes and combine the matched columns
                            
                                Trouble parsing tabular items from a graph located in a website
                            
                                Pip won't install packages in virtualenv
                            
                                Running Python script from Azure WebJob
                            
                                I have some problem with my homework. It's about stop the loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does a pandas dataframe with sparse columns take up more memory?

Tags:

python

pandas

dataframe

FChm

People also ask

1 Answers

Ofer Sadan

Recent Activity

Donate For Us