Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each <code>(ticker, date)</code> tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.) Initially, my rows in a specific order, but not sorted by any of the columns. I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannot simply do <pre class="prettyprint"><code>df['diffs'] = df['value'].diff() </code></pre> because adjacent rows do not come from the same ticker. Sorting like this: <pre class="prettyprint"><code>df = df.sort(['ticker', 'date']) df['diffs'] = df['value'].diff() </code></pre> doesn't solve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an <code>NaN</code> in its diff column. This seems like an obvious time to use <code>groupby</code> but for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process: <ol> <li>Group rows based on their <code>ticker</code> </li> <li>Within each group, sort rows by their <code>date</code> </li> <li>Within each sorted group, compute differences of the <code>value</code> column</li> <li>Put these differences into the original dataframe in a new <code>diffs</code> column (ideally leaving the original dataframe order in tact.)</li> </ol> I have to imagine this is a one-liner. But what am I missing? <hr> Edit at 9:00pm 2013-12-17 Ok...some progress. I can do the following to get a new dataframe: <pre class="prettyprint"><code>result = df.set_index(['ticker', 'date'])\ .groupby(level='ticker')\ .transform(lambda x: x.sort_index().diff())\ .reset_index() </code></pre> But if I understand the mechanics of groupby, my rows will now be sorted first by <code>ticker</code> and then by <code>date</code>. Is that correct? If so, would I need to do a merge to append the differences column (currently in <code>result['current']</code> to the original dataframe <code>df</code>?

wouldn't be just easier to do what yourself describe, namely <pre class="prettyprint"><code>df.sort(['ticker', 'date'], inplace=True) df['diffs'] = df['value'].diff() </code></pre> and then correct for borders: <pre class="prettyprint"><code>mask = df.ticker != df.ticker.shift(1) df['diffs'][mask] = np.nan </code></pre> to maintain the original index you may do <code>idx = df.index</code> in the beginning, and then at the end you can do <code>df.reindex(idx)</code>, or if it is a huge dataframe, perform the operations on <pre class="prettyprint"><code>df.filter(['ticker', 'date', 'value']) </code></pre> and then <code>join</code> the two dataframes at the end. edit: alternatively, ( though still not using <code>groupby</code> ) <pre class="prettyprint"><code>df.set_index(['ticker','date'], inplace=True) df.sort_index(inplace=True) df['diffs'] = np.nan for idx in df.index.levels[0]: df.diffs[idx] = df.value[idx].diff() </code></pre> for <pre class="prettyprint"><code> date ticker value 0 63 C 1.65 1 88 C -1.93 2 22 C -1.29 3 76 A -0.79 4 72 B -1.24 5 34 A -0.23 6 92 B 2.43 7 22 A 0.55 8 32 A -2.50 9 59 B -1.01 </code></pre> this will produce: <pre class="prettyprint"><code> value diffs ticker date A 22 0.55 NaN 32 -2.50 -3.05 34 -0.23 2.27 76 -0.79 -0.56 B 59 -1.01 NaN 72 -1.24 -0.23 92 2.43 3.67 C 22 -1.29 NaN 63 1.65 2.94 88 -1.93 -3.58 </code></pre>

Computing diffs within groups of a dataframe

Tags:

python

pandas

Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date) tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.)

Initially, my rows in a specific order, but not sorted by any of the columns.

I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannot simply do

df['diffs'] = df['value'].diff()

because adjacent rows do not come from the same ticker. Sorting like this:

df = df.sort(['ticker', 'date']) df['diffs'] = df['value'].diff()

doesn't solve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an NaN in its diff column.

This seems like an obvious time to use groupby but for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process:

Group rows based on their ticker
Within each group, sort rows by their date
Within each sorted group, compute differences of the value column
Put these differences into the original dataframe in a new diffs column (ideally leaving the original dataframe order in tact.)

I have to imagine this is a one-liner. But what am I missing?

Edit at 9:00pm 2013-12-17

Ok...some progress. I can do the following to get a new dataframe:

result = df.set_index(['ticker', 'date'])\     .groupby(level='ticker')\     .transform(lambda x: x.sort_index().diff())\     .reset_index()

But if I understand the mechanics of groupby, my rows will now be sorted first by ticker and then by date. Is that correct? If so, would I need to do a merge to append the differences column (currently in result['current'] to the original dataframe df?

673

asked Dec 18 '13 01:12

8one6

2 Answers

wouldn't be just easier to do what yourself describe, namely

df.sort(['ticker', 'date'], inplace=True) df['diffs'] = df['value'].diff()

and then correct for borders:

mask = df.ticker != df.ticker.shift(1) df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

df.filter(['ticker', 'date', 'value'])

and then join the two dataframes at the end.

edit: alternatively, ( though still not using groupby )

df.set_index(['ticker','date'], inplace=True) df.sort_index(inplace=True) df['diffs'] = np.nan   for idx in df.index.levels[0]:     df.diffs[idx] = df.value[idx].diff()

for

   date ticker  value 0    63      C   1.65 1    88      C  -1.93 2    22      C  -1.29 3    76      A  -0.79 4    72      B  -1.24 5    34      A  -0.23 6    92      B   2.43 7    22      A   0.55 8    32      A  -2.50 9    59      B  -1.01

this will produce:

             value  diffs ticker date               A      22     0.55    NaN        32    -2.50  -3.05        34    -0.23   2.27        76    -0.79  -0.56 B      59    -1.01    NaN        72    -1.24  -0.23        92     2.43   3.67 C      22    -1.29    NaN        63     1.65   2.94        88    -1.93  -3.58

112

answered Oct 01 '22 09:10

behzad.nouri

Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:

df.sort(['ticker', 'date'], inplace=True)  # for this example, with diff, I think this syntax is a bit clunky # but for more general examples, this should be good.  But can we do better? df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff())   df.sort_index(inplace=True)

This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20) to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupby takes care of that for you...).

So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:

df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)

when I do that, I get a diffs column full of 0's. Any thoughts on that?

answered Oct 01 '22 09:10

8one6

Related questions
                            
                                How to configure vim to not put comments at the beginning of lines while editing python files
                            
                                how to kill (or avoid) zombie processes with subprocess module
                            
                                What happens when a module is imported twice?
                            
                                How to apply a function on every row on a dataframe?
                            
                                How to use type hints in python 3.6?
                            
                                Is it possible to get pip to print the configuration it is using?
                            
                                Is json.loads() vulnerable to arbitrary code execution?
                            
                                When is StringIO used, as opposed to joining a list of strings?
                            
                                Python NoneType object is not callable (beginner)
                            
                                Forcing application/json MIME type in a view (Flask)
                            
                                Get seconds since midnight in Python [closed]
                            
                                time.time vs. timeit.timeit
                            
                                How to encode bytes in JSON? json.dumps() throwing a TypeError
                            
                                Selenium versus BeautifulSoup for web scraping
                            
                                for or while loop to do something n times
                            
                                How to get the current Python interpreter path from inside a Python script? [duplicate]
                            
                                Should a return statement have parentheses?
                            
                                Scikit-learn's LabelBinarizer vs. OneHotEncoder
                            
                                Does the SVM in sklearn support incremental (online) learning?
                            
                                SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Computing diffs within groups of a dataframe

Tags:

python

pandas

8one6

People also ask

2 Answers

behzad.nouri

8one6

Recent Activity

Donate For Us