Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas.concat on Sparse Dataframes... a mystery?

Why when concatenating 2 dataframes, the result is Sparse... but in a weird way ? How can I evaluate the memory occupated by the concatenated Dataframe ?

I wrote you guys a code sample to better understand the issue :

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'C': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
              'D': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'E': [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

df2 = pd.DataFrame({'F': [0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
              'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0],
              'H': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'J': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

print("df1 sparse size =", df1.memory_usage().sum(),"Bytes, density =", df1.density)
print(type(df1))
print('default_fill_value =', df1.default_fill_value)
print(df1.values)

print("df2 sparse size =", df2.memory_usage().sum(),"Bytes, density =", df2.density)
print(type(df2))
print('default_fill_value =', df2.default_fill_value)
print(df2.values)

result = pd.concat([df1,df2], axis=1)

print(type(result)) # Seems alright
print('default_fill_value =', result.default_fill_value) # The default fill value is not 0 ???
print(result.values) # What's that "nan" blocks ?
# result.density # Throw an error
# result.memory_usage # Throw an error

And more generally : Is anyone know what's happening over here ?

like image 391
Jean Lescut Avatar asked Nov 08 '22 21:11

Jean Lescut


1 Answers

This is a known problem and there is an issue for it.

like image 146
Mike Müller Avatar answered Nov 15 '22 12:11

Mike Müller