Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing the fill_values in a SparseDataFrame - replace throws TypeError

Current pandas version: 0.22


I have a SparseDataFrame.

A = pd.SparseDataFrame(
    [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']])

A

   0  1  2  3
0  a  0  0  b
1  0  0  0  c
2  0  0  0  0
3  0  0  0  a

Right now, the fill values are 0. However, I'd like to change the fill_values to np.nan. My first instinct was to call replace:

A.replace(0, np.nan)

But this gives

TypeError: cannot convert int to an sparseblock

Which doesn't really help me understand what I'm doing wrong.

I know I can do

A.to_dense().replace(0, np.nan).to_sparse()

But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?

like image 582
cs95 Avatar asked Jan 09 '18 04:01

cs95


People also ask

How do I change a column value based on conditions in pandas?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

What is SparseDataFrame?

In a SparseDataFrame , all columns were sparse. A DataFrame can have a mixture of sparse and dense columns. As a consequence, assigning new columns to a DataFrame with sparse values will not automatically convert the input to be sparse.

What is sparse data in Python?

Sparse data is data that has mostly unused elements (elements that don't carry any information ). It can be an array like this one: [1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0] Sparse Data: is a data set where most of the item values are zero. Dense Array: is the opposite of a sparse array: most of the values are not zero.

What is sparse data pandas?

pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value ( NaN / missing value, though any value can be chosen, including 0) is omitted.


1 Answers

tl;dr : That's definitely a bug.
But please keep reading, there is more than that...

All the following works fine with pandas 0.20.3, but not with any newer version:

A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])

etc... (you get the idea).

(from now on, all the code is done with pandas 0.20.3).

However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:

A.density

1.0

This SparseDataFrame is actually dense!
We can fix this by passing default_fill_value=0 :

A = pd.SparseDataFrame(
     [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']],default_fill_value=0)

Now A.density will output 0.25 as expected.

This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:

  • float64: np.nan
  • int64: 0
  • bool: False

But the dtypes of our SparseDataFrame are:

A.dtypes

0    object
1    object
2    object
3    object
dtype: object

And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan.

OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:

A.replace('a','z')
    0   1   2   3
0   z   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   z
And strangely:
A.replace(0,np.nan)
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a
And that's as you can see, is not correct!
From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace() works only with non-fill values. To change the fill value, you have the following options:

  • According to pandas docs, if you change the dtypes, that will automatically change the fill value. (That didn't work with me).
  • Convert into a dense DataFrame, do the replacement, then convert back into SparseDataFrame.
  • Manually reconstruct a new SparseDataFrame, like Wen's answer, or by passing default_fill_value set to the new fill value.

While I was experimenting with the last option, something even stranger happened:

B = pd.SparseDataFrame(A,default_fill_value=np.nan)

B.density
0.25

B.default_fill_value
nan

So far, so good. But... :

B
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a

That really shocked me at first. Is that even possible!?
Continuing on, I tried to see what is happening in the columns:

B[0]

0    a
1    0
2    0
3    0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

The dtype of the column is object, but the dtype of the BlockIndex associated with it is int32, hence the strange behavior.
There is a lot more "strange" things going on, but I'll stop here.
From all the above, I can say that you should avoid using SparseDataFrame till a complete re-write for it takes place :).

like image 125
Qusai Alothman Avatar answered Sep 29 '22 09:09

Qusai Alothman