How to "select distinct" across multiple data frame columns in pandas?

People also ask

How do I get unique values from multiple columns in pandas?

To find unique values from multiple columns, use the unique() method. Let's say you have Employee Records with “EmpName” and “Zone” in your Pandas DataFrame. The name and zone can get repeated since two employees can have similar names and a zone can have more than one employee.

How do I get unique values from two data frames?

To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. It returns the count of unique elements in multiple columns.

How do I get unique columns in pandas?

Unique is also referred to as distinct, you can get unique values in the column using pandas Series. unique() function, since this function needs to call on the Series object, use df['column_name'] to get the unique values as a Series.

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

I think use drop duplicate sometimes will not so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And work for me!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

Related questions
                            
                                Can anyone explain me StandardScaler?
                            
                                Convert row to column header for Pandas DataFrame,
                            
                                Why are empty strings returned in split() results?
                            
                                Python: Finding differences between elements of a list
                            
                                Underscore vs Double underscore with variables and methods [duplicate]
                            
                                Integrating Python Poetry with Docker
                            
                                How to convert a Scikit-learn dataset to a Pandas dataset
                            
                                Adding Python to PATH on Windows
                            
                                Installing Python 3 on RHEL
                            
                                How to update Python?
                            
                                Best way to create a simple python web service [closed]
                            
                                Why does the expression 0 < 0 == 0 return False in Python?
                            
                                Suppress/ print without b' prefix for bytes in Python 3
                            
                                How can I profile Python code line-by-line?
                            
                                What is a None value?
                            
                                Complex numbers in python
                            
                                What is the difference between class and instance attributes?
                            
                                Iterating Over Dictionary Key Values Corresponding to List in Python
                            
                                How should I read a file line-by-line in Python?
                            
                                How to extract the year from a Python datetime object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to "select distinct" across multiple data frame columns in pandas?

Tags:

python

pandas

dataframe

duplicates

distinct

People also ask

Recent Activity

Donate For Us