Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, how to filter a df to get unique entries?

I have a dataframe like this:

ID  type value
1   A    8
2   A    5
3   B    11
4   C    12
5   D    1
6   D    22
7   D    13

I want to filter the dataframe so that I have a unique occurrence of "type" attrybute (e.g. A appears only once), and if there are more rows that have the same value for "type" I want to choose the one with higher value. I want to get something like:

ID  type value
1   A    8
3   B    11
4   C    12
6   D    22

How do I do this with pandas?

like image 881
Gioelelm Avatar asked Jan 28 '14 10:01

Gioelelm


People also ask

How do I filter unique rows in pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do I get unique values in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

How do I filter unique columns in pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.


2 Answers

one way is to sort the dataframe and then take the first after a groupby.

# first way
sorted = df.sort_values(['type', 'value'], ascending = [True, False])

first = sorted.groupby('type').first().reset_index()

another way does not necessarily take only the first one, so potentially it would keep all IDs corresponding to the same maximum (and not take just 1 of them)

# second way
grouped = df.groupby('type').agg({'value': max}).reset_index()
grouped = grouped.set_index(['type','value'])

second = grouped.join(df.set_index(['type', 'value']))

example:

data

ID  type    value
1   A   8
2   A   5
3   B   11
4   C   12
5   D   1
6   D   22
7   D   13
8   D   22

first method results in

type  ID  value
A   1      8
B   3     11
C   4     12
D   6     22

second method keeps ID=8

            ID
type value    
A    8       1
B    11      3
C    12      4
D    22      6
     22      8

(you can reset_index() again here if you don't like the multiindex)

like image 185
mkln Avatar answered Oct 13 '22 04:10

mkln


df[['type', 'value']].drop_duplicates(subset=['type'])

This works generally, if you would have more columns, you can select the interested columns, in our case we chose all, 'type', 'value'.

like image 36
vesszabo Avatar answered Oct 13 '22 04:10

vesszabo