I have a pandas DataFrame with duplicate values for a set of columns. For example: <pre class="prettyprint"><code>df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10}) In [2]: df Out[2]: Column1 Column2 Column3 Column4 is_duplicated dup_index 0 1 ABC DEF 10 False 0 1 2 XYZ DEF 40 False 1 2 3 ABC DEF 10 True 0 </code></pre> Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1). I am looking for the following output: <code>Is_Duplicate</code>, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)] <code>Dup_Index</code> the original index of the duplicate row. <pre class="prettyprint"><code>In [3]: df Out[3]: Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index 0 1 ABC DEF 10 False 0 1 2 XYZ DEF 40 False 1 2 3 ABC DEF 10 True 0 </code></pre>

There is a DataFrame method <code>duplicated</code> for the first column: <pre class="prettyprint"><code>In [11]: df.duplicated(['Column2', 'Column3', 'Column4']) Out[11]: 0 False 1 False 2 True In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4']) </code></pre> To do the second you could try something like this: <pre class="prettyprint"><code>In [13]: g = df.groupby(['Column2', 'Column3', 'Column4']) In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4']) In [15]: df1.index.map(lambda ind: g.indices[ind][0]) Out[15]: array([0, 1, 0]) In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0]) In [17]: df Out[17]: Column1 Column2 Column3 Column4 is_duplicated dup_index 0 1 ABC DEF 10 False 0 1 2 XYZ DEF 40 False 1 2 3 ABC DEF 10 True 0 </code></pre>

Let's say your dataframe is stored in <code>df</code>. You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data: <pre class="prettyprint"><code>df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first() </code></pre> you can then merge this new dataframe with the original one by using the merge function: <pre class="prettyprint"><code>df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex')) </code></pre> You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed: <pre class="prettyprint"><code>df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex'] df['Dup_Index'] = None df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex']) del df['Column1_dupindex'] </code></pre>

How to identify the first occurence of duplicate rows in Python pandas Dataframe

Tags:

pandas

dataframe

python-2.7

I have a pandas DataFrame with duplicate values for a set of columns. For example:

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df
Out[2]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).

I am looking for the following output:

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Dup_Index the original index of the duplicate row.

In [3]: df
Out[3]: 
   Column1 Column2 Column3  Column4  Is_Duplicate  Dup_Index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

632

asked Feb 19 '13 08:02

user1652054

2 Answers

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]: 
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

180

answered Oct 07 '22 22:10

Andy Hayden

Let's say your dataframe is stored in df.

You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:

df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()

you can then merge this new dataframe with the original one by using the merge function:

df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))

You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:

df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']

answered Oct 07 '22 21:10

Zeugma

Related questions
                            
                                How do I created nested JSON object with Python?
                            
                                'easy_install' is not recognized as an in internal or external command, operable program or batch file
                            
                                How to keep the window focus on new Toplevel() window in Tkinter?
                            
                                Comparing string and unicode in Python 2.7.5
                            
                                Why does python allow spaces between an object and the method name after the "."
                            
                                NLTK package to estimate the (unigram) perplexity
                            
                                gitpython: Command syntax for git commit
                            
                                Python SocketServer: sending to multiple clients?
                            
                                How to generate a number of n-bit in length using python? [duplicate]
                            
                                fill_between gives "ValueError: Argument dimensions are incompatible"
                            
                                How to store os.system() output in a variable or a list in python [duplicate]
                            
                                Fast ping sweep in python
                            
                                How to convert tuple to a multi nested dictionary in python?
                            
                                Display notifications in Gnome Shell
                            
                                Escape single quote (') in raw string r'...'
                            
                                How to apply format as 'Text' and 'Accounting' using xlsxwriter
                            
                                xlwings function to find the last row with data
                            
                                Symbol not found: _BIO_new_CMS
                            
                                Python filename, not markup. open this file and pass the filehandle into Beautiful Soup
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With