I have a dataset : <pre class="prettyprint"><code>id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 4 B.com No 5 C.com No </code></pre> I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES. Expected output : <pre class="prettyprint"><code>id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 5 C.com No </code></pre> What I tried : <pre class="prettyprint"><code>Dataframe=Dataframe.drop_duplicates(subset='url', keep='first') </code></pre> which of course does not take into account "keep_if_dup" field. Output is : <pre class="prettyprint"><code>id url keep_if_dup 1 A.com Yes 3 B.com No 5 C.com No </code></pre>

You can pass multiple boolean conditions to <code>loc</code>, the first keeps all rows where col 'keep_if_dup' == 'Yes', this is <code>or</code>ed (using <code>|</code>) with the inverted boolean mask of whether col 'url' column is duplicated or not: <pre class="prettyprint"><code>In [79]: df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()] Out[79]: id url keep_if_dup 0 1 A.com Yes 1 2 A.com Yes 2 3 B.com No 4 5 C.com No </code></pre> to overwrite your df self-assign back: <pre class="prettyprint"><code>df = df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()] </code></pre> breaking down the above shows the 2 boolean masks: <pre class="prettyprint"><code>In [80]: ~df['url'].duplicated() Out[80]: 0 True 1 False 2 True 3 False 4 True Name: url, dtype: bool In [81]: df['keep_if_dup'] =='Yes' Out[81]: 0 True 1 True 2 False 3 False 4 False Name: keep_if_dup, dtype: bool </code></pre>

Pandas : remove SOME duplicate values based on conditions

Tags:

python

pandas

duplicates

I have a dataset :

id    url     keep_if_dup
1     A.com   Yes
2     A.com   Yes
3     B.com   No
4     B.com   No
5     C.com   No

I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES.

Expected output :

id    url     keep_if_dup
1     A.com   Yes
2     A.com   Yes
3     B.com   No
5     C.com   No

What I tried :

Dataframe=Dataframe.drop_duplicates(subset='url', keep='first')

which of course does not take into account "keep_if_dup" field. Output is :

id    url     keep_if_dup
1     A.com   Yes
3     B.com   No
5     C.com   No

464

asked Jul 26 '16 07:07

Vincent

1 Answers

You can pass multiple boolean conditions to loc, the first keeps all rows where col 'keep_if_dup' == 'Yes', this is ored (using |) with the inverted boolean mask of whether col 'url' column is duplicated or not:

In [79]:
df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]

Out[79]:
   id    url keep_if_dup
0   1  A.com         Yes
1   2  A.com         Yes
2   3  B.com          No
4   5  C.com          No

to overwrite your df self-assign back:

df = df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]

breaking down the above shows the 2 boolean masks:

In [80]:
~df['url'].duplicated()

Out[80]:
0     True
1    False
2     True
3    False
4     True
Name: url, dtype: bool

In [81]:
df['keep_if_dup'] =='Yes'

Out[81]:
0     True
1     True
2    False
3    False
4    False
Name: keep_if_dup, dtype: bool

answered Oct 01 '22 07:10

EdChum

Related questions
                            
                                ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df
                            
                                Replace values in column of Pandas DataFrame using a Series lookup table
                            
                                Python: accept unicode strings as regular strings in doctests
                            
                                How can I asyncio schedule a filesystem stat operation?
                            
                                How to Make a Portable Jupyter Slideshow
                            
                                Django F doesn't seem to work?
                            
                                Splash lua script to do multiple clicks and visits
                            
                                Jupyter & IPython: What does %matplotlib inline do?
                            
                                PySpark Evaluation
                            
                                Can we make correlated queries with SQLAlchemy
                            
                                Assigning (instead of defining) a __getitem__ magic method breaks indexing [duplicate]
                            
                                Can't install datasets package via pip
                            
                                Processing large XLSX file in python
                            
                                Numpy int array: Find indices of multiple target ints
                            
                                tox can't detect python interpreter in D:\python27 path
                            
                                Using slicers on a multi-index
                            
                                How to replace a contour (rectangle) in an image with a new image using Python?
                            
                                calculating slope for a series trendline in Pandas
                            
                                Is there a way to directly "decorate" a block of Python code?
                            
                                Difference between __new__ and __init__ order in Python2/3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With