I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby.
df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)
Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2
In this case, I want to duplicate only the record for 2/1/17 since there is only one instance for that date (i.e. where the MaxPathID == 1).
Desired Output:
Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2
Thanks in advance!
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
Finding duplicate rows To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.
I think you need get unique rows by Date Completed and then concat rows to original:
df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
  Date Completed
3         2/1/17
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          2
6         2/1/17       2          2
4         2/2/17       1          2
5         2/2/17       2          2
EDIT:
print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7
df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
  Date Completed  a  b
3         2/1/17  7  9
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
6         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7
                        A creative numpy approach using duplicated + repeat
dc = df['Date Completed']
rg = np.arange(len(dc)).repeat((~dc.duplicated(keep=False).values) + 1)
df.iloc[rg]
  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          1
3         2/1/17       1          1
4         2/2/17       1          2
5         2/2/17       2          2
                        I know this might be a little bit of a different problem but it does match question description, so people will come from goolge. I haven't looked into optimizing below or anything like that, I am sure there is a better way, but sometimes just have to embrace imperfections ;) so just posting here in case somebody faces similar and wants to try fast and done. Seemed to work fairly fast.
Suppose we have dataframe (df) like so:

And we want to transform it to something like this given condition that field3 has more than one entry and we want to expand all entries within like so:

Here is one approach for that:
import pandas as pd
import numpy as np
from datetime import date,datetime
index = []
double_values = []
### get index and get list of values on which to expand per indexed row
for i,r in df.iterrows():
    index.append(i)
    ### below transform your column with multiple entries to a list based on delimetter
    double_values.append(str(r[2]).split(' '))
serieses = []
print('tot row to process', len(index))
count = 0
for i,dvs in zip(index,double_values):
    count+= 1
    if count % 1000 == 0:
        print('elem left', len(index)- count, datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
    if len(dvs)>1:
        for dv in dvs:
            series = df.iloc[i]
            series.loc['field3'] = dv
            serieses.append(list(series))
#create dataframe out of expanded rows now appended to serieses list, creating a list of lists
df2 = pd.DataFrame.from_records(serieses,columns=df.columns)
### drop original rows with double entries, which have been expanded and appended already
indexes_to_drop = []
for i,dvs in zip(index,double_values):
    if len(dvs)>1:
        indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop],inplace=True)
len(df)
df = df.append(df2)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With