pandas unique values multiple columns

People also ask

How do I get unique values in multiple columns in pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.

How do I get unique values from multiple columns in Python?

To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. It returns the count of unique elements in multiple columns.

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method that returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.

An alternative way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

I have setup a DataFrame with a few simple strings in it's columns:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

You can concatenate the columns you are interested in and call unique function:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

Or:

set(df.Col1) | set(df.Col2)

An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

Non-pandas solution: using set().

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

Output:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

Related questions
                            
                                How to read a .xlsx file using the pandas Library in iPython?
                            
                                How do I correctly setup and teardown for my pytest class with tests?
                            
                                What is Truthy and Falsy? How is it different from True and False?
                            
                                Showing line numbers in IPython/Jupyter Notebooks
                            
                                Flask raises TemplateNotFound error even though template file exists
                            
                                How can I see function arguments in IPython Notebook Server 3?
                            
                                How to break a line of chained methods in Python?
                            
                                Flask SQLAlchemy query, specify column names
                            
                                "ssl module in Python is not available" when installing package with pip3
                            
                                Fitting empirical distribution to theoretical ones with Scipy (Python)?
                            
                                error installing psycopg2, library not found for -lssl
                            
                                decorators in the python standard lib (@deprecated specifically)
                            
                                A tool to convert MATLAB code to Python [closed]
                            
                                Is explicitly closing files important?
                            
                                Can't import my own modules in Python
                            
                                What is choice_set in this Django app tutorial?
                            
                                Python Mocking a function from an imported module
                            
                                How to keep index when using pandas merge
                            
                                How to do parallel programming in Python?
                            
                                Is nested function a good approach when required by only one function? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas unique values multiple columns

Tags:

python

pandas

dataframe

unique

People also ask

Recent Activity

Donate For Us