Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check whether each value of one column maps to exactly one value in another column?

I have a dataframe like this

import pandas as pd

df = pd.DataFrame({'A':list('bbcddee'), 'B': list('klmnnoi')})

   A  B
0  b  k
1  b  l
2  c  m
3  d  n
4  d  n
5  e  o
6  e  i

and I would like to create a dictionary from the columns A and B using e.g.

dict(zip(df.A, df.B))

Before doing this, I would like to check whether each value in A is mapped to only one value in B; if not, an error should be thrown; above that is not the case as b is mapped to k and l and e is mapped to o and i.

One way of approaching it would be:

df[df.groupby('A', sort=False)['B'].transform(lambda x: len(set(x))) > 1]

which returns

   A  B
0  b  k
1  b  l
5  e  o
6  e  i

However, that requires a lambda which might make it slow. Does anyone see an option to speed it up?

like image 614
Cleb Avatar asked Jan 24 '26 21:01

Cleb


1 Answers

You can groupby with nunique to get how many unique values in 'B' belong to each unique value in 'A'.

df.groupby('A').B.nunique()
#A
#b    2
#c    1
#d    1
#e    2
#Name: B, dtype: int64

And so you can check if any of them have more than 1 mapping:

df.groupby('A').B.nunique().gt(1).any()
#True

The above is conceptually no different from what you proposed. However, there is often a major performance gain if you are able to use a built-in groupby operation, which has been "optimized", as opposed to a slow lambda that requires a loop. We can see that as the DataFrame gets large the lambda can become nearly 100x slower, which is a big deal when things are starting to take seconds to compute.

import perfplot
import pandas as pd
import numpy as np

def gb_lambda(df):
    return df.groupby('A')['B'].apply(lambda x: len(set(x))).gt(1)

def gb_nunique(df):
    return df.groupby('A').B.nunique().gt(1)

perfplot.show(
    setup=lambda n: pd.DataFrame({'A': np.random.randint(0, n//2, n), 
                                  'B': np.random.randint(0, n//2, n)}),
    kernels=[
        lambda df: gb_lambda(df),
        lambda df: gb_nunique(df),
    ],
    labels=['groupby with lambda', 'Groupby.nunique'],
    n_range=[2 ** k for k in range(2,18)],
    equality_check=np.allclose,  
    xlabel='~len(df)'
)

enter image description here

like image 194
ALollz Avatar answered Jan 26 '26 13:01

ALollz