Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine if columns of a pandas dataframe uniquely identify the rows

I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.

The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:

subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))

This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.

like image 946
Kyle Heuton Avatar asked Jul 24 '18 01:07

Kyle Heuton


People also ask

How do I check if a row is unique in pandas?

Get the unique values (distinct rows) of the dataframe in python pandas. drop_duplicates() function is used to get the unique values (rows) of the dataframe in python pandas. The above drop_duplicates() function removes all the duplicate rows and returns only unique rows.

How do you check if a column is unique in pandas?

To check if the index has unique values, use the index. is_unique.

How do I get unique values from a column in a DataFrame Python?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

How do you find if all the values in a column are unique?

The easiest way to identify unique and distinct values in Excel is by using the IF function together with COUNTIF. There can be a few variations of the formula depending on the type of values that you want to find, as demonstrated in the following examples.


3 Answers

You could make an index and check its is_unique attribute:

import pandas as pd

df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))

df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))

print(df1.set_index(['A','B']).index.is_unique)
# False

print(df2.set_index(['A','B']).index.is_unique)
# True
like image 151
unutbu Avatar answered Oct 04 '22 06:10

unutbu


Maybe groupby size

df.groupby(['x','y']).size()==1
Out[308]: 
x  y
1  a     True
2  b     True
3  c     True
4  d    False
dtype: bool
like image 32
BENY Avatar answered Oct 04 '22 05:10

BENY


You can check

df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()

To see if there are any duplicated rows with the sets of value from columns x and y.

Example:

df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})


    x   y
0   1   a
1   2   b
2   3   c
3   4   d
4   4   d

Then transform

0    (1, a)
1    (2, b)
2    (3, c)
3    (4, d)
4    (4, d)
dtype: object

then check which are duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

Notice that transforming into tuple might not be necessary

df.duplicated(keep=False)

0    False
1    False
2    False
3     True
4     True
dtype: bool
like image 38
rafaelc Avatar answered Oct 04 '22 07:10

rafaelc