Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Easy Way to See if Two Columns are One-to-One in Pandas

Working with data in Python 3+ with pandas. It seems like there should be an easy way to check if two columns have a one-to-one relationship (regardless of column type), but I'm struggling to think of the best way to do this.

Example of expected output:

A    B     C
0    'a'   'apple'
1    'b'   'banana'
2    'c'   'apple'

A & B are one-to-one? TRUE

A & C are one-to-one? FALSE

B & C are one-to-one? FALSE

like image 667
user1895076 Avatar asked Jun 01 '18 12:06

user1895076


3 Answers

df.groupby(col1)[col2]\
  .apply(lambda x: x.nunique() == 1)\
  .all()

should work fine if you want a true or false answer.

A nice way to visualize the relationship between two columns with discrete / categorical values (in case you are using Jupyter notebook) is :

df.groupby([col1, col2])\
  .apply(lambda x : x.count())\
  .iloc[:,0]\
  .unstack()\
  .fillna(0)

This matrix will tell you the correspondence between the column values in the two columns.

In case of a one-to-one relationship there will be only one non-zero value per row in the matrix.

like image 163
all_the_questions Avatar answered Sep 18 '22 13:09

all_the_questions


df.groupby('A').B.nunique().max()==1 #Output: True

df.groupby('B').C.nunique().max()==1 #Output: False

Within each value in [groupby column], count the number of unique values in [other column], then check that the maximum for all such counts is one

like image 39
Trevor B Avatar answered Sep 20 '22 13:09

Trevor B


Well, you can create your own function to check it:

def isOneToOne(df, col1, col2):
    first = df.groupby(col1)[col2].count().max()
    second = df.groupby(col2)[col1].count().max()
    return first + second == 2

isOneToOne(df, 'A', 'B')
#True
isOneToOne(df, 'A', 'C')
#False
isOneToOne(df, 'B', 'C')
#False

In case you data is more like this:

df = pd.DataFrame({'A': [0, 1, 2, 0],
                   'C': ["'apple'", "'banana'", "'apple'", "'apple'"],
                   'B': ["'a'", "'b'", "'c'", "'a'"]})
df
#   A    B         C
#0  0  'a'   'apple'
#1  1  'b'  'banana'
#2  2  'c'   'apple'
#3  0  'a'   'apple'

Then you can use:

def isOneToOne(df, col1, col2):
    first = df.drop_duplicates([col1, col2]).groupby(col1)[col2].count().max()
    second = df.drop_duplicates([col1, col2]).groupby(col2)[col1].count().max()
    return first + second == 2
like image 31
zipa Avatar answered Sep 17 '22 13:09

zipa