Current pandas version: 0.22
I have a SparseDataFrame.
A = pd.SparseDataFrame(
[['a',0,0,'b'],
[0,0,0,'c'],
[0,0,0,0],
[0,0,0,'a']])
A
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
Right now, the fill values are 0
. However, I'd like to change the fill_values to np.nan
. My first instinct was to call replace
:
A.replace(0, np.nan)
But this gives
TypeError: cannot convert int to an sparseblock
Which doesn't really help me understand what I'm doing wrong.
I know I can do
A.to_dense().replace(0, np.nan).to_sparse()
But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
In a SparseDataFrame , all columns were sparse. A DataFrame can have a mixture of sparse and dense columns. As a consequence, assigning new columns to a DataFrame with sparse values will not automatically convert the input to be sparse.
Sparse data is data that has mostly unused elements (elements that don't carry any information ). It can be an array like this one: [1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0] Sparse Data: is a data set where most of the item values are zero. Dense Array: is the opposite of a sparse array: most of the values are not zero.
pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value ( NaN / missing value, though any value can be chosen, including 0) is omitted.
tl;dr : That's definitely a bug.
But please keep reading, there is more than that...
All the following works fine with pandas 0.20.3, but not with any newer version:
A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])
etc... (you get the idea).
(from now on, all the code is done with pandas 0.20.3).
However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:
A.density
1.0
This SparseDataFrame is actually dense!
We can fix this by passing default_fill_value=0
:
A = pd.SparseDataFrame(
[['a',0,0,'b'],
[0,0,0,'c'],
[0,0,0,0],
[0,0,0,'a']],default_fill_value=0)
Now A.density
will output 0.25
as expected.
This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:
Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:
- float64: np.nan
- int64: 0
- bool: False
But the dtypes of our SparseDataFrame are:
A.dtypes
0 object
1 object
2 object
3 object
dtype: object
And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan
.
OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:
And strangely:
A.replace('a','z')
0 1 2 3
0 z 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 z
And that's as you can see, is not correct!
A.replace(0,np.nan)
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace()
works only with non-fill values.
To change the fill value, you have the following options:
DataFrame
, do the replacement, then convert back into SparseDataFrame
.SparseDataFrame
, like Wen's answer, or by passing default_fill_value
set to the new fill value.While I was experimenting with the last option, something even stranger happened:
B = pd.SparseDataFrame(A,default_fill_value=np.nan)
B.density
0.25
B.default_fill_value
nan
So far, so good. But... :
B
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
That really shocked me at first. Is that even possible!?
Continuing on, I tried to see what is happening in the columns:
B[0]
0 a
1 0
2 0
3 0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
The dtype of the column is object
, but the dtype of the BlockIndex
associated with it is int32
, hence the strange behavior.
There is a lot more "strange" things going on, but I'll stop here.
From all the above, I can say that you should avoid using SparseDataFrame
till a complete re-write for it takes place :).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With