I have a dataframe like this:
Date PlumeO Distance
2014-08-13 13:48:00 754.447905 5.844577
2014-08-13 13:48:00 754.447905 6.888653
2014-08-13 13:48:00 754.447905 6.938860
2014-08-13 13:48:00 754.447905 6.977284
2014-08-13 13:48:00 754.447905 6.946430
2014-08-13 13:48:00 754.447905 6.345506
2014-08-13 13:48:00 754.447905 6.133567
2014-08-13 13:48:00 754.447905 5.846046
2014-08-13 16:59:00 754.447905 6.345506
2014-08-13 16:59:00 754.447905 6.694847
2014-08-13 16:59:00 754.447905 5.846046
2014-08-13 16:59:00 754.447905 6.977284
2014-08-13 16:59:00 754.447905 6.938860
2014-08-13 16:59:00 754.447905 5.844577
2014-08-13 16:59:00 754.447905 6.888653
2014-08-13 16:59:00 754.447905 6.133567
2014-08-13 16:59:00 754.447905 6.946430
I'm trying to keep the date with the smallest distance, so drop the duplicates dates and keep the with the smallest distance.
Is there a way to achieve this in pandas' df.drop_duplicates
or am I stuck using if statements to find the smallest distance?
If you are new to Python pandas check out an article on, Pandas in Python. 1. Python pandas drop duplicates 2. Python Pandas drop duplicates based on column 3. Python pandas drop duplicates keep last 4. Pandas drop duplicates multiple columns 5. Python pandas drop duplicates subset 6. Python pandas drop duplicates index 7.
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates () method: Syntax: dataframe.dropDuplicates ( [‘column 1′,’column 2′,’column n’]).show ()
Approach: 1 We will drop duplicate columns based on two columns 2 Let those columns be ‘order_id’ and ‘customer_id’ 3 Keep the latest entry only 4 Reset the index of dataframe More ...
drop_duplicates (self, subset=None, keep= "first", inplace= False) subset: Subset takes a column or list of column label for identifying duplicate rows. By default, all the columns are used to find the duplicate rows. keep: allowed values are {‘first’, ‘last’, False}, default ‘first’.
Sort by distances and drop by dates:
df.sort_values('Distance').drop_duplicates(subset='Date', keep='first')
Out:
Date PlumeO Distance
0 2014-08-13 13:48:00 754.447905 5.844577
13 2014-08-13 16:59:00 754.447905 5.844577
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With