Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas How to create a new dataframe with a start and end even if on different rows

Tags:

python

pandas

I have a pandas dataframe with 2 columns. Some of the MessageID's end on the same row that they start with the NewMessageID like in index row 0 below. But others like index row 2 doesnt end until index row 4. I am looking for a clever way to simplify the output in a new dataframe.

df
    MessageID   NewMessageID
0   28          10
1   21          9
2   4           18
3   3           6
4   18          22
5   99          102
6   102         118
7   1           20

I am looking for an output like:

df1
    Start  Finish
0   28     10 
1   21     9
2   4      22
3   3      6
4   99     118
5   1      20 
like image 974
sectechguy Avatar asked Sep 06 '19 18:09

sectechguy


People also ask

Can pandas DataFrame be created using series?

Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

How do you create a new DataFrame with only selected columns?

You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.


2 Answers

I have yet another solution, since I noticed the most up-voted solution will not work in a scenario where there are more than two rows to be linked. I added yet another connection, from 22 -> 23 to show that it works in such a scenario.

def main():
    b = pd.DataFrame()
    b['MessageID'] = [28, 21, 4, 3, 18, 99, 22, 102, 1]
    b['NewMessageID'] = [10, 9, 18, 6, 22, 102, 23, 118, 20]
    b = b.rename(columns={'MessageID': 'Start', 'NewMessageID': 'End'})
    rows_to_drop = []
    for i, row in b.iterrows():
        recursion(i, row, b, rows_to_drop)
    b.drop(index=rows_to_drop, inplace=True)


def recursion(i, row, b, rows_to_drop):
    exists = b[b['Start'] == row['End']]
    if not exists.empty and i not in rows_to_drop and exists.index[0] not in rows_to_drop:
        b.at[i, 'End'] = exists['End']
        rows_to_drop.append(exists.index[0])
        for _i, _row in b.iterrows():
            recursion(_i, _row, b, rows_to_drop)

Output:

   Start  End
0     28   10
1     21    9
2      4   23
3      3    6
5     99  118
8      1   20

It clearly is suboptimal - we are iterating over a dataframe here. But it should do the trick, and be efficient enough for relatively small datasets.

It has yet another upside - we are maintaining the input order.

like image 29
Epion Avatar answered Oct 07 '22 10:10

Epion


Join on itself, to create df2, drop rows from original df which have common values between the two columns. Keep the outer two columns of df2 and rename them to match df and append one to the other.

df = pd.DataFrame({'MessageID':[28,21,4,3,18,99,102,1],'NewMessageID':[10,9,18,6,22,102,118,20]})

df2 = df.merge(df, left_on='NewMessageID', right_on='MessageID')
df2 = df2[['MessageID_x','NewMessageID_y']]
df2.columns = ['MessageID', 'NewMessageID']

df = df[(~df['MessageID'].isin(df['NewMessageID'].values.tolist())) & (~df['NewMessageID'].isin(df['MessageID'].values.tolist()))]

output = df.append(df2)


              MessageID  NewMessageID
    0         28            10
    1         21             9
    3          3             6
    7          1            20
    0          4            22
    1         99           118
like image 121
Chris Avatar answered Oct 07 '22 09:10

Chris