I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB
.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB
.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
Changing numeric columns to smaller dtype: Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.
The first way is to change the data type of an object column in a dataframe to the category in the case of categorical data. This does not affect the way the dataframe looks but reduces the memory usage significantly.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage. When it needs access again, it will read the data from the swap file and back into memory.
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
Which means you have to specify that it's the 0
elements that should be compressed. You can do that by using fill_value=0
, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info()
is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0
is the default fill_value
only in arrays of data.dtype=int
, which yours weren't
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With