Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame:

> df = pd.DataFrame({'a': ['a1', 'a2', 'a3'],
                     'b': ['optional1', None, 'optional3'],
                     'c': ['c1', 'c2', 'c3'],
                     'd': [1, 2, 3]})
> df

    a          b   c  d
0  a1  optional1  c1  1
1  a2       None  c2  2
2  a3  optional3  c3  3

Pivot method 1

The data can be pivoted to this:

> df.pivot_table(index=['a','b'], columns='c')
                d     
c              c1   c3
a  b                  
a1 optional1  1.0  NaN
a3 optional3  NaN  3.0

Downside: data in the 2nd row is lost because df['b'][1] == None.

Pivot method 2

> df.pivot_table(index=['a'], columns='c')
      d          
c    c1   c2   c3
a                
a1  1.0  NaN  NaN
a2  NaN  2.0  NaN
a3  NaN  NaN  3.0

Downside: column b is lost.

How can the two methods be combined so that columns b and the 2nd row are kept like so:

                d     
c              c1   c2   c3
a  b                  
a1 optional1  1.0  NaN  NaN
a2      None  NaN  2.0  NaN
a3 optional3  NaN  NaN  3.0

More generally: How can information from a row be retained during pivoting if a key has NaN value?

like image 721
victorlin Avatar asked Jan 24 '17 21:01

victorlin


People also ask

What are NaN values in a data frame?

NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float. NaN value is one of the major problems in Data Analysis.

What is a NaN value in pandas?

Within pandas, a missing value is denoted by NaN . In most cases, the terms missing and null are interchangeable, but to abide by the standards of pandas, we'll continue using missing throughout this tutorial.

Which function is useful for filling values in place of NaN?

The fillna() function is used to fill NA/NaN values using the specified method. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame).


1 Answers

Use set_index and unstack to perform the pivot:

df = df.set_index(['a', 'b', 'c']).unstack('c')

This is essentially what pandas does under the hood for pivot. The stack and unstack methods are closely related to pivot, and can generally be used to perform pivot-like operations that don't quite conform with the built-in pivot functions.

The resulting output:

                d          
c              c1   c2   c3
a  b                       
a1 optional1  1.0  NaN  NaN
a2 NaN        NaN  2.0  NaN
a3 optional3  NaN  NaN  3.0
like image 129
root Avatar answered Oct 11 '22 13:10

root