Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Duplicate Row based on condition

I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby.

df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

In this case, I want to duplicate only the record for 2/1/17 since there is only one instance for that date (i.e. where the MaxPathID == 1).

Desired Output:

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

Thanks in advance!

like image 317
Walt Reed Avatar asked Mar 27 '17 18:03

Walt Reed


People also ask

How do you drop duplicate rows in pandas based on a column?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.

What is a correct method to discover if a row is a duplicate?

Finding duplicate rows To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.


3 Answers

I think you need get unique rows by Date Completed and then concat rows to original:

df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
  Date Completed
3         2/1/17

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          2
6         2/1/17       2          2
4         2/2/17       1          2
5         2/2/17       2          2

EDIT:

print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7

df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
  Date Completed  a  b
3         2/1/17  7  9

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
6         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7
like image 138
jezrael Avatar answered Oct 05 '22 18:10

jezrael


A creative numpy approach using duplicated + repeat

dc = df['Date Completed']
rg = np.arange(len(dc)).repeat((~dc.duplicated(keep=False).values) + 1)
df.iloc[rg]

  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          1
3         2/1/17       1          1
4         2/2/17       1          2
5         2/2/17       2          2
like image 45
piRSquared Avatar answered Oct 05 '22 18:10

piRSquared


I know this might be a little bit of a different problem but it does match question description, so people will come from goolge. I haven't looked into optimizing below or anything like that, I am sure there is a better way, but sometimes just have to embrace imperfections ;) so just posting here in case somebody faces similar and wants to try fast and done. Seemed to work fairly fast.

Suppose we have dataframe (df) like so:

enter image description here

And we want to transform it to something like this given condition that field3 has more than one entry and we want to expand all entries within like so:

enter image description here

Here is one approach for that:

import pandas as pd
import numpy as np
from datetime import date,datetime

index = []
double_values = []


### get index and get list of values on which to expand per indexed row
for i,r in df.iterrows():
    index.append(i)
    ### below transform your column with multiple entries to a list based on delimetter
    double_values.append(str(r[2]).split(' '))

serieses = []

print('tot row to process', len(index))
count = 0
for i,dvs in zip(index,double_values):
    count+= 1
    if count % 1000 == 0:
        print('elem left', len(index)- count, datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
    if len(dvs)>1:
        for dv in dvs:
            series = df.iloc[i]
            series.loc['field3'] = dv
            serieses.append(list(series))

#create dataframe out of expanded rows now appended to serieses list, creating a list of lists
df2 = pd.DataFrame.from_records(serieses,columns=df.columns)

### drop original rows with double entries, which have been expanded and appended already
indexes_to_drop = []
for i,dvs in zip(index,double_values):
    if len(dvs)>1:
        indexes_to_drop.append(i)

df.drop(df.index[indexes_to_drop],inplace=True)
len(df)


df = df.append(df2)
like image 20
Yev Guyduy Avatar answered Oct 05 '22 18:10

Yev Guyduy