Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame contains NaNs after write operation

Tags:

python

pandas

Here is a minimum working example of my problem:

import pandas as pd

columns = pd.MultiIndex.from_product([['a', 'b', 'c'], range(2)])
a = pd.DataFrame(0.0, index=range(3),columns=columns, dtype='float')
b = pd.Series([13.0, 15.0])

a.loc[1,'b'] = b  # this line results in NaNs
a.loc[1,'b'] = b.values  # this yields correct behavior

Why is the first assignment incorrect? Both Series seem to have the same index, so I assume it should produce the correct result.

I am using pandas 0.17.0.

like image 965
MindV0rtex Avatar asked Nov 11 '15 22:11

MindV0rtex


1 Answers

When you write

a.loc[1,'b'] = b

and b is a Series, the index of b has to exactly match the indexer generated by a.loc[1,'b'] in order for the values in b to be copied into a. It turns out, however, that when a.columns is a MultiIndex, the indexer for a.loc[1,'b'] is:

(Pdb) p new_ix
Index([(u'b', 0), (u'b', 1)], dtype='object')

whereas the index for b is

(Pdb) p ser.index
Int64Index([0, 1], dtype='int64')

They don't match, and therefore

(Pdb) p ser.index.equals(new_ix)
False

Since the values aren't aligned, the code branch you fall into assigns

(Pdb) p ser.reindex(new_ix).values
array([ nan,  nan])

I found this by adding pdb.set_trace() to your code:

import pandas as pd

columns = pd.MultiIndex.from_product([['a', 'b', 'c'], range(2)])
a = pd.DataFrame(0.0, index=range(3),columns=columns, dtype='float')
b = pd.Series([13.0, 15.0])
import pdb
pdb.set_trace()
a.loc[1,'b'] = b  # this line results in NaNs
a.loc[1,'b'] = b.values  # this yields correct behavior

and simply stepping through it at a "high level" and finding the problem occurs in

        if isinstance(value, ABCSeries):
            value = self._align_series(indexer, value)

and then stepping through it again (with a finer-toothed comb) with a break point starting at the line calling self._align_series(indexer, value).


Notice that if you change the index of b to also be a MultiIndex:

b = pd.Series([13.0, 15.0], index=pd.MultiIndex.from_product([['b'], [0,1]]))

then

import pandas as pd

columns = pd.MultiIndex.from_product([['a', 'b', 'c'], range(2)])
a = pd.DataFrame(0.0, index=range(3),columns=columns, dtype='float')
b = pd.Series([13.0, 15.0], index=pd.MultiIndex.from_product([['b'], [0,1]]))
a.loc[1,'b'] = b  
print(a)

yields

   a      b      c   
   0  1   0   1  0  1
0  0  0   0   0  0  0
1  0  0  13  15  0  0
2  0  0   0   0  0  0
like image 113
unutbu Avatar answered Sep 29 '22 04:09

unutbu