<pre class="prettyprint"><code>a = [['John', 'Mary', 'John'], [10,22,50]] df1 = pd.DataFrame(a, columns=['Name', 'Count']) </code></pre> Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python. Ex: In the case above the Answer would be: <ul> <li>Name Count</li> <li>Mary 22</li> <li>John 50</li> </ul> The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name"). In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object. Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D

Either <code>sort_values</code> and <code>drop_duplicates</code>, <pre class="prettyprint"><code>df1.sort_values('Count').drop_duplicates('Name', keep='last') Name Count 1 Mary 22 2 John 50 </code></pre> Or, like miradulo said, <code>groupby</code> and <code>max</code>. <pre class="prettyprint"><code>df1.groupby('Name')['Count'].max().reset_index() Name Count 0 John 50 1 Mary 22 </code></pre>

Drop duplicates keeping the row with the highest value in another column

Tags:

python

pandas

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])

Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.

Ex: In the case above the Answer would be:

Name Count
Mary 22
John 50

The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").

In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.

Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D

771

asked Jul 21 '18 20:07

Kafka

1 Answers

Either sort_values and drop_duplicates,

df1.sort_values('Count').drop_duplicates('Name', keep='last')

   Name  Count
1  Mary     22
2  John     50

Or, like miradulo said, groupby and max.

df1.groupby('Name')['Count'].max().reset_index()

   Name  Count
0  John     50
1  Mary     22

answered Sep 28 '22 20:09

cs95

Related questions
                            
                                Is it possible to export a pandas dataframe styler object to html
                            
                                Pandas json_normalize and null values in JSON
                            
                                Difference between pip3 and `python3 setup.py install` regarding cmdclass argument
                            
                                How to mock uuid generation in a test case?
                            
                                What is the default Celery log level if none is specified?
                            
                                reading a WAV file from TIMIT database in python
                            
                                How to retrieve an Enum key via variable
                            
                                EOF marker not found while use PyPDF2 merge pdf file in python
                            
                                Django - Signature of method does not match signature of base method in class
                            
                                Is there a way to adjust shutter speed or exposure time of a webcam using Python and OpenCV
                            
                                Configure lru_cache for class and static methods
                            
                                Variable not found. Declare it as envvar or define a default value
                            
                                One-hot encoding multi-level column data
                            
                                Oversampling functionality in Tensorflow dataset API
                            
                                How to pip install *.whl on Windows (using a wildcard)
                            
                                Mask out sensitive information in python log
                            
                                Print all columns and rows of a numpy array [duplicate]
                            
                                Django save previous object from models
                            
                                Randomly sample from multiple tf.data.Datasets in Tensorflow
                            
                                What's the right way to insert a CalibratedClassifierCV in a scikit-learn pipeline?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With