I have a pandas dataframe with 2 columns. Some of the MessageID's
end on the same row that they start with the NewMessageID
like in index row 0 below. But others like index row 2 doesnt end until index row 4. I am looking for a clever way to simplify the output in a new dataframe.
df
MessageID NewMessageID
0 28 10
1 21 9
2 4 18
3 3 6
4 18 22
5 99 102
6 102 118
7 1 20
I am looking for an output like:
df1
Start Finish
0 28 10
1 21 9
2 4 22
3 3 6
4 99 118
5 1 20
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
I have yet another solution, since I noticed the most up-voted solution will not work in a scenario where there are more than two rows to be linked. I added yet another connection, from 22 -> 23 to show that it works in such a scenario.
def main():
b = pd.DataFrame()
b['MessageID'] = [28, 21, 4, 3, 18, 99, 22, 102, 1]
b['NewMessageID'] = [10, 9, 18, 6, 22, 102, 23, 118, 20]
b = b.rename(columns={'MessageID': 'Start', 'NewMessageID': 'End'})
rows_to_drop = []
for i, row in b.iterrows():
recursion(i, row, b, rows_to_drop)
b.drop(index=rows_to_drop, inplace=True)
def recursion(i, row, b, rows_to_drop):
exists = b[b['Start'] == row['End']]
if not exists.empty and i not in rows_to_drop and exists.index[0] not in rows_to_drop:
b.at[i, 'End'] = exists['End']
rows_to_drop.append(exists.index[0])
for _i, _row in b.iterrows():
recursion(_i, _row, b, rows_to_drop)
Output:
Start End
0 28 10
1 21 9
2 4 23
3 3 6
5 99 118
8 1 20
It clearly is suboptimal - we are iterating over a dataframe here. But it should do the trick, and be efficient enough for relatively small datasets.
It has yet another upside - we are maintaining the input order.
Join on itself, to create df2
, drop rows from original df
which have common values between the two columns. Keep the outer two columns of df2
and rename them to match df
and append one to the other.
df = pd.DataFrame({'MessageID':[28,21,4,3,18,99,102,1],'NewMessageID':[10,9,18,6,22,102,118,20]})
df2 = df.merge(df, left_on='NewMessageID', right_on='MessageID')
df2 = df2[['MessageID_x','NewMessageID_y']]
df2.columns = ['MessageID', 'NewMessageID']
df = df[(~df['MessageID'].isin(df['NewMessageID'].values.tolist())) & (~df['NewMessageID'].isin(df['MessageID'].values.tolist()))]
output = df.append(df2)
MessageID NewMessageID
0 28 10
1 21 9
3 3 6
7 1 20
0 4 22
1 99 118
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With