I have a dataframe with 3 columns Replaced_ID
, New_ID
and Installation Date
of New_ID
.
Each New_ID replaces the Replaced_ID.
Replaced_ID New_ID Installation Date (of New_ID)
3 5 16/02/2018
5 7 17/05/2019
7 9 21/06/2019
9 11 23/08/2020
25 39 16/02/2017
39 41 16/08/2018
My goal is to get a dataframe which includes the first and last record of the sequence. I care only for the first Replaced_ID value and the last New_ID value.
i.e from above dataframe I want this
Replaced_ID New_ID Installation Date (of New_ID)
3 11 23/08/2020
25 41 16/08/2018
Sorting by date and perform shift is not the solution here as far as I can imagine.
Also, I tried to join the columns New_ID
with Replaced_ID
but this is not the case because it returns only the previous sequence.
I need to find a way to get the sequence [3,5,7,9,11]
& [25,41]
combining the Replaced_ID
& New_ID
columns for all rows.
I care mostly about getting the first Replaced_ID
value and the last New_ID
value and not the Installation Date
because I can perform join in the end.
Any ideas here? Thanks.
First, let's create the DataFrame:
import pandas as pd
import numpy as np
from io import StringIO
data = """Replaced_ID,New_ID,Installation Date (of New_ID)
3,5,16/02/2018
5,7,17/05/2019
7,9,21/06/2019
9,11,23/08/2020
25,39,16/02/2017
39,41,16/08/2018
11,14,23/09/2020
41,42,23/10/2020
"""
### note that I've added two rows to check whether it works with non-consecutive rows
### defining some short hands
r = "Replaced_ID"
n = "New_ID"
i = "Installation Date (of New_ID)"
df = pd.read_csv(StringIO(data),header=0,parse_dates=True,sep=",")
df[i] = pd.to_datetime(df[i], )
And now for my actual solution:
a = df[[r,n]].values.flatten()
### returns a flat list of r and n values which clearly show duplicate entries, i.e.:
# [ 3 5 5 7 7 9 9 11 25 39 39 41 11 14 41 42]
### now only get values that occur once,
# and reshape them nicely, such that the first column gives the lowest (replaced) id,
# and the second column gives the highest (new) id, i.e.:
# [[ 3 14]
# [25 42]]
u, c = np.unique( a, return_counts=True)
res = u[c == 1].reshape(2,-1)
### now filter the dataframe where "New_ID" is equal to the second column of res, i.e. [14,42]:
# and replace the entries in "r" with the "lowest possible values" of r
dfn = df[ df[n].isin(res[:,1].tolist()) ]
# print(dfn)
dfn.loc[:][r] = res[:,0]
print(dfn)
Which yields:
Replaced_ID New_ID Installation Date (of New_ID)
6 3 14 2020-09-23
7 25 42 2020-10-23
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With