I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?
Here is some data that has the right structure (code modeled on an answer here):
import pandas as pd
import numpy as np
import random
from itertools import product
random.seed(1) # so you can play along at home
np.random.seed(2) # ditto
# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))
mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})
Ok, great. This gives me a frame like so:
date ticker value
0 2013-10-03 ticker_2 0.435995
1 2013-10-04 ticker_2 0.025926
2 2013-10-02 ticker_1 0.549662
3 2013-10-01 ticker_0 0.435322
4 2013-10-02 ticker_2 0.420368
5 2013-10-03 ticker_0 0.330335
6 2013-10-04 ticker_1 0.204649
7 2013-10-02 ticker_0 0.619271
8 2013-10-01 ticker_2 0.299655
My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:
data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1
and returns:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:
data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2
A quick check shows that, in fact, data1
is equal to data2
. However, if I do this:
data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3
I get a strange result:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0
1 2013-10-04 ticker_2 0.025926 NaN
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 NaN
5 2013-10-03 ticker_0 0.330335 0
6 2013-10-04 ticker_1 0.204649 NaN
7 2013-10-02 ticker_0 0.619271 NaN
8 2013-10-01 ticker_2 0.299655 0
What's going on here? When you call the .diff
method on a Pandas object, is it not just calling np.diff
? I know there's a diff
method on the DataFrame
class, but I couldn't figure out how to pass that to transform
without the lambda
function syntax I used to make data1
work. Am I missing something? Why is the diffs
column in data3
screwy? How can I have call the Pandas diff
method within transform
without needing to write a lambda
to do it?
Nice easy to reproduce example!! more questions should be like this!
Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2
In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
In [34]: data3.sort_index(inplace=True)
In [25]: data3
Out[25]:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
[9 rows x 4 columns]
I believe that np.diff
doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__
on input __array_wrap__
on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff
is not dealing with the index properly and doing its own calculation (which in this case is wrong).
Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.
You can see that the Series .diff()
method is different to np.diff()
:
In [11]: data.value.diff() # Note the NaN
Out[11]:
0 NaN
1 -0.410069
2 0.523736
3 -0.114340
4 -0.014955
5 -0.090033
6 -0.125686
7 0.414622
8 -0.319616
Name: value, dtype: float64
In [12]: np.diff(data.value.values) # the values array of the column
Out[12]:
array([-0.41006867, 0.52373625, -0.11434009, -0.01495459, -0.09003298,
-0.12568619, 0.41462233, -0.31961629])
In [13]: np.diff(data.value) # on the column (Series)
Out[13]:
0 NaN
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 NaN
Name: value, dtype: float64
In [14]: np.diff(data.value.index) # er... on the index
Out[14]: Int64Index([8], dtype=int64)
In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With