Pandas - Duplicate Row based on condition

Tags:

I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby.

df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

In this case, I want to duplicate only the record for 2/1/17 since there is only one instance for that date (i.e. where the MaxPathID == 1).

Desired Output:

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

Thanks in advance!

317

asked Mar 27 '17 18:03

3 Answers

I think you need get unique rows by Date Completed and then concat rows to original:

df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
  Date Completed
3         2/1/17

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          2
6         2/1/17       2          2
4         2/2/17       1          2
5         2/2/17       2          2

EDIT:

print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7

df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
  Date Completed  a  b
3         2/1/17  7  9

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
6         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7

138

answered Oct 05 '22 18:10

I know this might be a little bit of a different problem but it does match question description, so people will come from goolge. I haven't looked into optimizing below or anything like that, I am sure there is a better way, but sometimes just have to embrace imperfections ;) so just posting here in case somebody faces similar and wants to try fast and done. Seemed to work fairly fast.

Suppose we have dataframe (df) like so:

enter image description here

And we want to transform it to something like this given condition that field3 has more than one entry and we want to expand all entries within like so:

enter image description here

Here is one approach for that:

import pandas as pd
import numpy as np
from datetime import date,datetime

index = []
double_values = []


### get index and get list of values on which to expand per indexed row
for i,r in df.iterrows():
    index.append(i)
    ### below transform your column with multiple entries to a list based on delimetter
    double_values.append(str(r[2]).split(' '))

serieses = []

print('tot row to process', len(index))
count = 0
for i,dvs in zip(index,double_values):
    count+= 1
    if count % 1000 == 0:
        print('elem left', len(index)- count, datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
    if len(dvs)>1:
        for dv in dvs:
            series = df.iloc[i]
            series.loc['field3'] = dv
            serieses.append(list(series))

#create dataframe out of expanded rows now appended to serieses list, creating a list of lists
df2 = pd.DataFrame.from_records(serieses,columns=df.columns)

### drop original rows with double entries, which have been expanded and appended already
indexes_to_drop = []
for i,dvs in zip(index,double_values):
    if len(dvs)>1:
        indexes_to_drop.append(i)

df.drop(df.index[indexes_to_drop],inplace=True)
len(df)


df = df.append(df2)

answered Oct 05 '22 18:10

Yev Guyduy

Related questions
                            
                                Is there a browser based IDE for Python like RStudio server for R?
                            
                                Check object permission before create in Django rest framework
                            
                                Under what circumstances does Scrapy throw a "Connection was closed cleanly" error?
                            
                                Python does not recognize a module which is set by PYTHONPATH
                            
                                How can I split DataFrame (pandas) on pages with django paginator?
                            
                                Class imported from two different paths is not equal?
                            
                                Fix 'new enumerations must be created as'
                            
                                Python eval doesn't work inside a function [duplicate]
                            
                                Python OpenCV - overlay an image with transparency
                            
                                Pandas DataFrame iloc spoils the data type
                            
                                IOError: [Errno 2] No such file or directory: 'README.md'
                            
                                How can I send Json Data from javaScript to Flask
                            
                                Flask application GET returning the same thing twice
                            
                                Jupyter Notebook: Timeout waiting for kernel_info_reply
                            
                                Undefined variable from import when using vtk
                            
                                Python tf-idf: fast way to update the tf-idf matrix
                            
                                Python pyodbc cursor vs database cursor
                            
                                making requests to localhost from inside docker container
                            
                                Keras: Tokenizer with fit_generator() on text data
                            
                                'numpy.ndarray' object has no attribute 'count'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Duplicate Row based on condition

Tags:

python

pandas

duplicates

group-by

Walt Reed

People also ask

3 Answers

jezrael

piRSquared

Yev Guyduy

Recent Activity

Donate For Us