Current pandas version: <code>0.22</code> <hr> I have a SparseDataFrame. <pre class="prettyprint"><code>A = pd.SparseDataFrame( [['a',0,0,'b'], [0,0,0,'c'], [0,0,0,0], [0,0,0,'a']]) </code></pre> <pre class="prettyprint"><code>A 0 1 2 3 0 a 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 a </code></pre> Right now, the fill values are <code>0</code>. However, I'd like to change the fill_values to <code>np.nan</code>. My first instinct was to call <code>replace</code>: <pre class="prettyprint"><code>A.replace(0, np.nan) </code></pre> But this gives <pre class="prettyprint"><code>TypeError: cannot convert int to an sparseblock </code></pre> Which doesn't really help me understand what I'm doing wrong. I know I can do <pre class="prettyprint"><code>A.to_dense().replace(0, np.nan).to_sparse() </code></pre> But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?

tl;dr : That's definitely a bug. But please keep reading, there is more than that... All the following works fine with pandas 0.20.3, but not with any newer version: <pre class="prettyprint"><code>A.replace(0,np.nan) A.replace({0:np.nan}) A.replace([0],[np.nan]) </code></pre> etc... (you get the idea). (from now on, all the code is done with pandas 0.20.3). However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this: <pre class="prettyprint"><code>A.density 1.0 </code></pre> This SparseDataFrame is actually dense! We can fix this by passing <code>default_fill_value=0</code> : <pre class="prettyprint"><code>A = pd.SparseDataFrame( [['a',0,0,'b'], [0,0,0,'c'], [0,0,0,0], [0,0,0,'a']],default_fill_value=0) </code></pre> Now <code>A.density</code> will output <code>0.25</code> as expected. This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs: <blockquote> Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes: <ul> <li>float64: np.nan</li> <li>int64: 0</li> <li>bool: False</li> </ul> </blockquote> But the dtypes of our SparseDataFrame are: <pre class="prettyprint"><code>A.dtypes 0 object 1 object 2 object 3 object dtype: object </code></pre> And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default <code>np.nan</code>. OK, so now we have a SparseDataFrame. Let's try to replace some entries in it: <code><pre class="prettyprint"> A.replace('a','z') 0 1 2 3 0 z 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 z </pre></code> And strangely: <code><pre class="prettyprint"> A.replace(0,np.nan) 0 1 2 3 0 a 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 a </pre></code> And that's as you can see, is not correct! From my own experiments with different versions of pandas, it seems that <code>SparseDataFrame.replace()</code> works only with non-fill values. To change the fill value, you have the following options: <ul> <li>According to pandas docs, if you change the dtypes, that will automatically change the fill value. (That didn't work with me).</li> <li>Convert into a dense <code>DataFrame</code>, do the replacement, then convert back into <code>SparseDataFrame</code>.</li> <li>Manually reconstruct a new <code>SparseDataFrame</code>, like Wen's answer, or by passing <code>default_fill_value</code> set to the new fill value.</li> </ul> While I was experimenting with the last option, something even stranger happened: <pre class="prettyprint"><code>B = pd.SparseDataFrame(A,default_fill_value=np.nan) B.density 0.25 B.default_fill_value nan </code></pre> So far, so good. But... : <pre class="prettyprint"><code>B 0 1 2 3 0 a 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 a </code></pre> That really shocked me at first. Is that even possible!? Continuing on, I tried to see what is happening in the columns: <pre class="prettyprint"><code>B[0] 0 a 1 0 2 0 3 0 Name: 0, dtype: object BlockIndex Block locations: array([0], dtype=int32) Block lengths: array([1], dtype=int32) </code></pre> The dtype of the column is <code>object</code>, but the dtype of the <code>BlockIndex</code> associated with it is <code>int32</code>, hence the strange behavior. There is a lot more "strange" things going on, but I'll stop here. From all the above, I can say that you should avoid using <code>SparseDataFrame</code> till a complete re-write for it takes place :).

Changing the fill_values in a SparseDataFrame - replace throws TypeError

Tags:

python

pandas

sparse-matrix

sparse-dataframe

Current pandas version: 0.22

I have a SparseDataFrame.

A = pd.SparseDataFrame(
    [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']])

A

   0  1  2  3
0  a  0  0  b
1  0  0  0  c
2  0  0  0  0
3  0  0  0  a

Right now, the fill values are 0. However, I'd like to change the fill_values to np.nan. My first instinct was to call replace:

A.replace(0, np.nan)

But this gives

TypeError: cannot convert int to an sparseblock

Which doesn't really help me understand what I'm doing wrong.

I know I can do

A.to_dense().replace(0, np.nan).to_sparse()

But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?

582

asked Jan 09 '18 04:01

cs95

1 Answers

tl;dr : That's definitely a bug.
But please keep reading, there is more than that...

All the following works fine with pandas 0.20.3, but not with any newer version:

A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])

etc... (you get the idea).

(from now on, all the code is done with pandas 0.20.3).

However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:

A.density

1.0

This SparseDataFrame is actually dense!
We can fix this by passing default_fill_value=0 :

A = pd.SparseDataFrame(
     [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']],default_fill_value=0)

Now A.density will output 0.25 as expected.

This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:

float64: np.nan

int64: 0

bool: False

But the dtypes of our SparseDataFrame are:

A.dtypes

0    object
1    object
2    object
3    object
dtype: object

And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan.

OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:

A.replace('a','z')
    0   1   2   3
0   z   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   z

And strangely:

A.replace(0,np.nan)
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a

And that's as you can see, is not correct!
From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace() works only with non-fill values. To change the fill value, you have the following options:

According to pandas docs, if you change the dtypes, that will automatically change the fill value. (That didn't work with me).
Convert into a dense DataFrame, do the replacement, then convert back into SparseDataFrame.
Manually reconstruct a new SparseDataFrame, like Wen's answer, or by passing default_fill_value set to the new fill value.

While I was experimenting with the last option, something even stranger happened:

B = pd.SparseDataFrame(A,default_fill_value=np.nan)

B.density
0.25

B.default_fill_value
nan

So far, so good. But... :

B
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a

That really shocked me at first. Is that even possible!?
Continuing on, I tried to see what is happening in the columns:

B[0]

0    a
1    0
2    0
3    0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

The dtype of the column is object, but the dtype of the BlockIndex associated with it is int32, hence the strange behavior.
There is a lot more "strange" things going on, but I'll stop here.
From all the above, I can say that you should avoid using SparseDataFrame till a complete re-write for it takes place :).

125

answered Sep 29 '22 09:09

Qusai Alothman

Related questions
                            
                                non-blocking lock with 'with' statement
                            
                                How to detect if a point is contained within a bounding rect - opecv & python
                            
                                Luigi Pipeline beginning in S3
                            
                                Callbacks with ctypes (How to call a python function from C)
                            
                                Problems implementing an XOR gate with Neural Nets in Tensorflow
                            
                                Interpolating a closed curve using scipy
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How can I send an email using python logging's SMTPHandler and SSL
                            
                                Doing pairwise distance computation with TensorFlow
                            
                                How to fillna() with value 0 after calling resample?
                            
                                Spyder / iPython inline plot figure size
                            
                                Why does a class need __iter__() to return an iterator?
                            
                                ValueError: time data does not match format '%Y-%m-%d %H:%M:%S.%f'
                            
                                reshape a pandas dataframe
                            
                                Difference between dictionary and pandas series in Python
                            
                                How to use an update function to animate a NetworkX graph in Matplotlib 2.0.0?
                            
                                using Tensorflow with Anaconda and PyCharm on Windows
                            
                                AttributeError: module 'cv2.cv2' has no attribute 'cv'
                            
                                Python pandas - new column's value if the item is in the list
                            
                                Find indices of duplicate rows in pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With