I have a MultiIndex pandas DataFrame in which I want to apply a function to one of its columns and assign the result to that same column.
In [1]:
import numpy as np
import pandas as pd
cols = ['One', 'Two', 'Three', 'Four', 'Five']
df = pd.DataFrame(np.array(list('ABCDEFGHIJKLMNO'), dtype='object').reshape(3,5), index = list('ABC'), columns=cols)
df.to_hdf('/tmp/test.h5', 'df')
df = pd.read_hdf('/tmp/test.h5', 'df')
df
Out[1]:
One Two Three Four Five
A A B C D E
B F G H I J
C K L M N O
3 rows × 5 columns
In [2]:
df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
df['L']['Five'] = df['L']['Five'].apply(lambda x: x.lower())
df
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Out[2]:
U L
One Two Three Four Five
A A B C D E
B F G H I J
C K L M N O
3 rows × 5 columns
In [3]:
df.columns = ['One', 'Two', 'Three', 'Four', 'Five']
df
Out[3]:
One Two Three Four Five
A A B C D E
B F G H I J
C K L M N O
3 rows × 5 columns
In [4]:
df['Five'] = df['Five'].apply(lambda x: x.upper())
df
Out[4]:
One Two Three Four Five
A A B C D E
B F G H I J
C K L M N O
3 rows × 5 columns
As you can see, the function is not applied to the column, I guess because I get this warning:
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
What is strange is that this error only happens sometimes, and I haven't been able to understand when does it happens and when not.
I managed to apply the function slicing the dataframe with .loc
as the warning recommended:
In [5]:
df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
df.loc[:,('L','Five')] = df.loc[:,('L','Five')].apply(lambda x: x.lower())
df
Out[5]:
U L
One Two Three Four Five
A A B C D e
B F G H I j
C K L M N o
3 rows × 5 columns
but I would like to understand why this behavior happens when doing dict-like slicing (e.g. df['L']['Five']
) and not when using the .loc
slicing.
NOTE: The DataFrame comes from an HDF file which was not multiindexed is this perhaps the cause of the strange behavior?
EDIT: I'm using Pandas v.0.13.1
and NumPy v.1.8.0
from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.
df['L']['Five']
is selecting the level 0 with the value 'L' and returning a DataFrame, which then the column 'Five' is selected, returning the accessed series.
The __getitem__
accessor for a Dataframe (the []
), will try to do the right thing, and gives you the correct column. However, this is chained indexing, see here
To access a multi-index, use the tuple notation, ('a','b')
and .loc
which is unambiguous, e.g. df.loc[:,('a','b')]
. Furthermore this allows multi-axes indexing at the same time (e.g. rows AND columns).
So, why does this not work when you do chained indexing and assignement, e.g. df['L']['Five'] = value
.
df['L']
rerturns a data frame that is singly-indexed. Then another python operation df_with_L['Five']
selects the series index by 'Five' happens. I indicated this by another variable. Because pandas sees these operations as separate events (e.g. separate calls to __getitem__
, so it has to treat them as linear operations, they happen one after another.
Contrast this to df.loc[:,('L','Five')]
which passes a nested tuple of (:,('L','Five'))
to a single call to __getitem__
. This allows pandas to deal with this as a single entity (and fyi be quite a bit faster because it can directly index into the frame).
Why does this matter? Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. Thus when setting this you are actually setting a copy, and not the original frame. It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected.
The SettingWithCopy
warning is a 'heuristic' to detect this (meaning it tends to catch most cases by is simply a lightweight check). Figuring this out for real is way complicated.
The .loc
operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified thus setting the values as you would think.
The reason for the warning, is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.
Note: this doesn't have anything to do with the source of the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With