I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid
function in Stata.
The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:
subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))
This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.
Get the unique values (distinct rows) of the dataframe in python pandas. drop_duplicates() function is used to get the unique values (rows) of the dataframe in python pandas. The above drop_duplicates() function removes all the duplicate rows and returns only unique rows.
To check if the index has unique values, use the index. is_unique.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
The easiest way to identify unique and distinct values in Excel is by using the IF function together with COUNTIF. There can be a few variations of the formula depending on the type of values that you want to find, as demonstrated in the following examples.
You could make an index and check its is_unique
attribute:
import pandas as pd
df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))
df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))
print(df1.set_index(['A','B']).index.is_unique)
# False
print(df2.set_index(['A','B']).index.is_unique)
# True
Maybe groupby
size
df.groupby(['x','y']).size()==1
Out[308]:
x y
1 a True
2 b True
3 c True
4 d False
dtype: bool
You can check
df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()
To see if there are any duplicated rows with the sets of value from columns x
and y
.
Example:
df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})
x y
0 1 a
1 2 b
2 3 c
3 4 d
4 4 d
Then transform
0 (1, a)
1 (2, b)
2 (3, c)
3 (4, d)
4 (4, d)
dtype: object
then check which are duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Notice that transforming
into tuple
might not be necessary
df.duplicated(keep=False)
0 False
1 False
2 False
3 True
4 True
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With