Determine if columns of a pandas dataframe uniquely identify the rows

Tags:

pandas

dataframe

I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.

The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:

subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))

This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.

946

asked Jul 24 '18 01:07

Kyle Heuton

3 Answers

You could make an index and check its is_unique attribute:

import pandas as pd

df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))

df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))

print(df1.set_index(['A','B']).index.is_unique)
# False

print(df2.set_index(['A','B']).index.is_unique)
# True

151

answered Oct 04 '22 06:10

unutbu

Maybe groupby size

df.groupby(['x','y']).size()==1
Out[308]: 
x  y
1  a     True
2  b     True
3  c     True
4  d    False
dtype: bool

answered Oct 04 '22 05:10

BENY

You can check

df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()

To see if there are any duplicated rows with the sets of value from columns x and y.

Example:

df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})


    x   y
0   1   a
1   2   b
2   3   c
3   4   d
4   4   d

Then transform

0    (1, a)
1    (2, b)
2    (3, c)
3    (4, d)
4    (4, d)
dtype: object

then check which are duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

Notice that transforming into tuple might not be necessary

df.duplicated(keep=False)

0    False
1    False
2    False
3     True
4     True
dtype: bool

answered Oct 04 '22 07:10

rafaelc

Related questions
                            
                                Masking multiple columns on a pandas dataframe in python
                            
                                Subtract first row from all rows in Pandas DataFrame
                            
                                How do I get a simple scatter plot of a dataframe (preferrably with seaborn)
                            
                                Pandas returns "Passed header names mismatches usecols" error
                            
                                Import pandas on jupyter ipython notebook fails
                            
                                pandas describe by - additional parameters
                            
                                replacing quotes, commas, apostrophes w/ regex - python/pandas
                            
                                Find max value and the corresponding column/index name in entire dataframe
                            
                                Shift pandas dataframe down in a cyclical manner
                            
                                Normal Distribution Plot by name from pandas dataframe
                            
                                How do I extract the date/year/month from pandas dataframe?
                            
                                Customized float formatting in a pandas DataFrame
                            
                                Checking if a data series is strings
                            
                                Get two return values from Pandas apply
                            
                                Pandas replace column values with a list
                            
                                Get percentage of rows (strings) that fulfil a certain condition in a pandas data frame
                            
                                Test if any column of a pandas DataFrame satisfies a condition
                            
                                row sum on a pandas pivot table
                            
                                Pandas: reading Excel file starting from the row below that with a specific value
                            
                                Check if dataframe has a zero element

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With