Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a function to a MultiIndex pandas.DataFrame column

I have a MultiIndex pandas DataFrame in which I want to apply a function to one of its columns and assign the result to that same column.

In [1]:
    import numpy as np
    import pandas as pd
    cols = ['One', 'Two', 'Three', 'Four', 'Five']
    df = pd.DataFrame(np.array(list('ABCDEFGHIJKLMNO'), dtype='object').reshape(3,5), index = list('ABC'), columns=cols)
    df.to_hdf('/tmp/test.h5', 'df')
    df = pd.read_hdf('/tmp/test.h5', 'df')
    df
Out[1]:
         One     Two     Three  Four    Five
    A    A       B       C      D       E
    B    F       G       H      I       J
    C    K       L       M      N       O
    3 rows × 5 columns

In [2]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df['L']['Five'] = df['L']['Five'].apply(lambda x: x.lower())
    df
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead 
Out[2]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [3]:
    df.columns = ['One', 'Two', 'Three', 'Four', 'Five']
    df    
Out[3]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [4]:
    df['Five'] = df['Five'].apply(lambda x: x.upper())
    df
Out[4]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

As you can see, the function is not applied to the column, I guess because I get this warning:

-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

What is strange is that this error only happens sometimes, and I haven't been able to understand when does it happens and when not.

I managed to apply the function slicing the dataframe with .loc as the warning recommended:

In [5]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df.loc[:,('L','Five')] = df.loc[:,('L','Five')].apply(lambda x: x.lower())
    df

Out[5]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       e
    B    F      G       H       I       j
    C    K      L       M       N       o
    3 rows × 5 columns

but I would like to understand why this behavior happens when doing dict-like slicing (e.g. df['L']['Five']) and not when using the .loc slicing.

NOTE: The DataFrame comes from an HDF file which was not multiindexed is this perhaps the cause of the strange behavior?

EDIT: I'm using Pandas v.0.13.1 and NumPy v.1.8.0

like image 950
VGonPa Avatar asked Apr 08 '14 09:04

VGonPa


People also ask

What does the pandas function MultiIndex From_tuples do?

from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.

How convert MultiIndex to columns in pandas?

pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.


1 Answers

df['L']['Five'] is selecting the level 0 with the value 'L' and returning a DataFrame, which then the column 'Five' is selected, returning the accessed series.

The __getitem__ accessor for a Dataframe (the []), will try to do the right thing, and gives you the correct column. However, this is chained indexing, see here

To access a multi-index, use the tuple notation, ('a','b') and .loc which is unambiguous, e.g. df.loc[:,('a','b')]. Furthermore this allows multi-axes indexing at the same time (e.g. rows AND columns).

So, why does this not work when you do chained indexing and assignement, e.g. df['L']['Five'] = value.

df['L'] rerturns a data frame that is singly-indexed. Then another python operation df_with_L['Five'] selects the series index by 'Five' happens. I indicated this by another variable. Because pandas sees these operations as separate events (e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('L','Five')] which passes a nested tuple of (:,('L','Five')) to a single call to __getitem__. This allows pandas to deal with this as a single entity (and fyi be quite a bit faster because it can directly index into the frame).

Why does this matter? Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. Thus when setting this you are actually setting a copy, and not the original frame. It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected.

The SettingWithCopy warning is a 'heuristic' to detect this (meaning it tends to catch most cases by is simply a lightweight check). Figuring this out for real is way complicated.

The .loc operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified thus setting the values as you would think.

The reason for the warning, is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.

Note: this doesn't have anything to do with the source of the data.

like image 116
Jeff Avatar answered Sep 28 '22 01:09

Jeff