Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to "select distinct" across multiple data frame columns in pandas?

People also ask

How do I get unique values from multiple columns in pandas?

To find unique values from multiple columns, use the unique() method. Let's say you have Employee Records with “EmpName” and “Zone” in your Pandas DataFrame. The name and zone can get repeated since two employees can have similar names and a zone can have more than one employee.

How do I get unique values from two data frames?

To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. It returns the count of unique elements in multiple columns.

How do I get unique columns in pandas?

Unique is also referred to as distinct, you can get unique values in the column using pandas Series. unique() function, since this function needs to call on the Series object, use df['column_name'] to get the unique values as a Series.


You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.


I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).


There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

I think use drop duplicate sometimes will not so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And work for me!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe