Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Identify Last Row by Date

I'm trying to accomplish two things in my Pandas dataframe:

  1. Create new column Last Row ('Yes' or 'No') based on new DateCompleted
  2. Capture the next transaction on the current row, unless it's a new DateCompleted (in which case mark as Null).

Original Dataset

        DateCompleted      TranNumber  Sales

    0   1/1/17 10:15AM     3133         130.31
    1   1/1/17 11:21AM     3531         103.12  
    2   1/1/17 12:31PM     3652         99.23  
    3   1/2/17 9:31AM      3689         83.22
    4   1/2/17 10:31AM     3701         29.93
    5   1/3/17 8:30AM      3709         31.31 

Desired Output

        DateCompleted      TranNumber   Sales    NextTranSales  LastRow

    0   1/1/17 10:15AM     3133         130.31   103.12         No
    1   1/1/17 11:21AM     3531         103.12   99.23          No
    2   1/1/17 12:31PM     3652         99.23    NaN            Yes
    3   1/2/17 9:31AM      3689         83.22    29.93          No 
    4   1/2/17 10:31AM     3701         29.93    NaN            Yes
    5   1/3/17 8:30AM      3709         31.31    ...            No

I can get the NextTranSales based on:

 df['NextTranSales'] = df.Sales.shift(-1)

But I'm having trouble determining the last row in the DateCompleted group and marking NextTranSales as Null if it is the last row.

Thanks for your help!

like image 888
Walt Reed Avatar asked Mar 24 '17 21:03

Walt Reed


People also ask

How do I select the last row in pandas?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

How do I find most recent date in pandas?

To get the most recent date: df["my_date"]. max() Timestamp('2021-12-27 00:00:00')

How do you find the last value of a data frame?

iloc – Pandas Dataframe. iloc is used to retrieve data by specifying its index. In python negative index starts from the end so we can access the last element of the dataframe by specifying its index to -1.

How do I find a data frame between two dates?

You can use pandas. Series. between() method to select DataFrame rows between two dates. This method returns a boolean vector representing whether series element lies in the specified range or not.


2 Answers

If your data frame has been sorted by the DateCompleted column, then you might just need groupby.shift:

date = pd.to_datetime(df.DateCompleted).dt.date    
df["NextTranSales"] = df.groupby(date).Sales.shift(-1)

enter image description here

If you need the LastRow column, you can find out the last row index with groupby and then assign yes to the rows:

last_row_index = df.groupby(date, as_index=False).apply(lambda g: g.index[-1])
df["LastRow"] = "No"
df.loc[last_row_index, "LastRow"] = "Yes"
df

enter image description here

like image 117
Psidom Avatar answered Sep 28 '22 14:09

Psidom


NOTE: This depends on Sales being free of NaN. If it has any NaN we will get erroneous determinations of last row. This happens because I'm leveraging the convenience that the shifted column leaves a NaN in the last position.

d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
df = df.assign(NextTranSales=s).assign(LastRow=s.isnull().map(m))
print(df)

        DateCompleted  TranNumber   Sales  NextTranSales LastRow
0 2017-01-01 10:15:00        3133  130.31         103.12      No
1 2017-01-01 11:21:00        3531  103.12          99.23      No
2 2017-01-01 12:31:00        3652   99.23            NaN     Yes
3 2017-01-02 09:31:00        3689   83.22          29.93      No
4 2017-01-02 10:31:00        3701   29.93            NaN     Yes
5 2017-01-03 08:30:00        3709   31.31            NaN     Yes

We can be free of the no NaN restriction with this

d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
l = pd.Series(
    'Yes', df.groupby(d).tail(1).index
).reindex(df.index, fill_value='No')
df.assign(NextTranSales=s).assign(LastRow=l)

        DateCompleted  TranNumber   Sales  NextTranSales LastRow
0 2017-01-01 10:15:00        3133  130.31         103.12      No
1 2017-01-01 11:21:00        3531  103.12          99.23      No
2 2017-01-01 12:31:00        3652   99.23            NaN     Yes
3 2017-01-02 09:31:00        3689   83.22          29.93      No
4 2017-01-02 10:31:00        3701   29.93            NaN     Yes
5 2017-01-03 08:30:00        3709   31.31            NaN     Yes
like image 40
piRSquared Avatar answered Sep 28 '22 14:09

piRSquared