I have a pandas dataframe:
a = pd.DataFrame(rand(5,6)*10, index=pd.DatetimeIndex(start='2005', periods=5, freq='A'))
a.columns = pd.MultiIndex.from_product([('A','B'),('a','b','c')])
I want to subtract the row a['2005']
from a
. To do that I've tried this:
In [22]:
a - a.ix['2005']
Out[22]:
A B
a b c a b c
2005-12-31 0 0 0 0 0 0
2006-12-31 NaN NaN NaN NaN NaN NaN
2007-12-31 NaN NaN NaN NaN NaN NaN
2008-12-31 NaN NaN NaN NaN NaN NaN
2009-12-31 NaN NaN NaN NaN NaN NaN
Which obviously doesn't work because pandas is lining up the index while doing the operation. This works:
In [24]:
pd.DataFrame(a.values - a['2005'].values, index=a.index, columns=a.columns)
Out[24]:
A B
a b c a b c
2005-12-31 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2006-12-31 -3.326761 -7.164628 8.188518 -0.863177 0.519587 -3.281982
2007-12-31 3.529531 -4.719756 8.444488 1.355366 7.468361 -4.023797
2008-12-31 3.139185 -8.420257 1.465101 -2.942519 1.219060 -5.146019
2009-12-31 -3.459710 0.519435 -1.049617 -2.779370 4.792227 -1.922461
But I don't want to have to form a new DataFrame every time I have to do this kind of operation. I've tried the apply() method like this: a.apply(lambda x: x-a['2005'].values)
but I get ValueError: cannot copy sequence with size 6 to array axis with dimension 5
So I'm not really sure how to proceed. Is there a simple way to do this that I am not seeing? I think there should be an easy way to do this in place so you don't have to construct a new dataframe each time. I also tried the sub()
method but the subtraction is only applied to the first row whereas I want to subtract the first row from each row in the dataframe.
Using iloc[] to Drop First N Rows of DataFrameUse DataFrame. iloc[] the indexing syntax [n:] with n as an integer to select the first n rows from pandas DataFrame. For example df. iloc[n:] , substitute n with the integer number specifying how many rows you wanted to delete.
Example #1: Use subtract() function to subtract each element of a dataframe with a corresponding element in a series.
Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
Pandas is great for aligning by index. So when you want Pandas to ignore the index, you need to drop the index. You can do that by converting the DataFrame a.loc['2005']
to a 1-dimensional NumPy array:
In [56]: a - a.loc['2005'].values.squeeze()
Out[56]:
A B
a b c a b c
2005-12-31 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2006-12-31 0.325968 1.314776 -0.789328 -0.344669 -2.518857 7.361711
2007-12-31 0.084203 2.234445 -2.838454 -6.176795 -3.645513 8.955443
2008-12-31 3.798700 0.299529 1.303325 -2.770126 -1.284188 3.093806
2009-12-31 1.520930 2.660040 0.846996 -9.437851 -2.886603 6.705391
The squeeze
method converts the NumPy array, a.loc['2005']
, of shape to (1, 6)
to an array of shape (6,)
. This allows the array to be broadcasted (during the subtraction) as desired.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With