Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting unique observations in a pandas data frame

Tags:

python

pandas

I have a pandas data frame with a column uniqueid. I would like to remove all duplicates from the data frame based on this column, such that all remaining observations are unique.

like image 244
Michael Avatar asked Oct 31 '13 23:10

Michael


People also ask

How do I get unique values in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

What does unique () method in pandas do?

Pandas: Series - unique() function The unique() function is used to get unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. The unique values returned as a NumPy array.

How do I get unique values in a Python series?

With the help of np. unique() method, we can get the unique values from an array given as parameter in np. unique() method.

How do I see unique rows in pandas?

We can get unique row values in Pandas DataFrame using the drop_duplicates() function. It removes all duplicate rows based on column values and returns unique rows. If you want to get duplicate rows from Pandas DataFrame you can use DataFrame. duplicated() function.


2 Answers

There is also the drop_duplicates() method for any data frame (docs here). You can pass specific columns to drop from as an argument.

df.drop_duplicates(subset='uniqueid', inplace=True)
like image 168
cwharland Avatar answered Sep 20 '22 21:09

cwharland


Use the duplicated method

Since we only care if uniqueid (A in my example) is duplicated, select that and call duplicated on that series. Then use the ~ to flip the bools.

In [90]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'c'], 'B': [1, 2, 3, 4]})

In [91]: df
Out[91]: 
   A  B
0  a  1
1  b  2
2  b  3
3  c  4

In [92]: df['A'].duplicated()
Out[92]: 
0    False
1    False
2     True
3    False
Name: A, dtype: bool

In [93]: df.loc[~df['A'].duplicated()]
Out[93]: 
   A  B
0  a  1
1  b  2
3  c  4
like image 21
TomAugspurger Avatar answered Sep 21 '22 21:09

TomAugspurger