Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matching between two columns and taking value from another in pandas

Tags:

python

pandas

First of all I am sorry if this question is already answered clearly. I have seen there are very similar answers, but I couldn't use it. So my problem is to match between two sets of columns (UsedFName==FName and UsedLName==LName) and then fill the Usedid Column with the ids from 'id' column when it fully matches.

So here is a toy data set

>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain     NaN
1         Nadia      Alam  2002    Tanvir   Hossain     NaN
2           Pia     Naime  2003    Tanvir   Hossain     NaN
3        Koethe  Talukdar  2004    Koethe  Talukdar     NaN
4        Manual   Hausman  2005    Koethe  Talukdar     NaN
5   Constantine      Pape   NaN       Max     Weber     NaN
6       Andreas       Kai  2006       Max     Weber     NaN
7           Max     Weber  2007    Manual   Hausman     NaN
8         Weber       Mac  2008    Manual   Hausman     NaN
9         Plank      Ingo  2009    Manual   Hausman     NaN
10       Tanvir   Hossain  2001       Pia     Naime     NaN
11        Weber       Mac  2008       Pia     Naime     NaN
12       Manual   Hausman  2005    Tanvir   Hossain     NaN
13          Max     Weber  2007    Tanvir   Hossain     NaN
14        Nadia      Alam  2002    Manual   Hausman     NaN
15        Weber       Mac  2008    Manual   Hausman     NaN
16          Pia     Naime  2003    Koethe  Talukdar     NaN
17          Pia     Naime  2003    Koethe  Talukdar     NaN
18  Constantine      Pape   NaN    Koethe  Talukdar     NaN
19       Koethe  Talukdar  2004    Koethe  Talukdar     NaN
20       Koethe  Talukdar  2005    Manual   Hausman     NaN
21          NaN       NaN   NaN    Manual   Hausman     NaN
22          NaN       NaN   NaN    Manual   Hausman     NaN
23          NaN       NaN   NaN    Manual   Hausman     NaN
24          NaN       NaN   NaN    Manual   Hausman     NaN
25          NaN       NaN   NaN    Manual   Hausman     NaN
26          NaN       NaN   NaN    Manual   Hausman     NaN
27          NaN       NaN   NaN    Manual   Hausman     NaN

This is the output

>>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

Actually I was able to do it using nested for loops, here is the code:

for i in df['UsedFName'].index:
    for j in df['FName'].index:
        if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
            df.ix[i,'Usedid'] = df.ix[j,'id']

But using nested for loops here is computationally very expensive. I have a huge data set. Is it possible to use it without nested loops? Is there any simple Pythonic ways or Pandas/Numpy ways that I can use here?

Many thanks in advance for the help...looking forward to learn Python.

like image 903
Shaikh Tanvir Hossain Avatar asked Dec 03 '25 15:12

Shaikh Tanvir Hossain


1 Answers

You will have to think of a more pandaesque may to add the hashing logic, but this matches your expected output and is a lot more efficient, all you want is to use the ids matching UsedFName and "UsedLName" to FName and LNames:

import pandas as pd

# Create dict where each  key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))

# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])

Output:

0     2001
1     2001
2     2001
3     2005
4     2005
5     2007
6     2007
7     2005
8     2005
9     2005
10    2003
11    2003
12    2001
13    2001
14    2005
15    2005
16    2005
17    2005
18    2005
19    2005
20    2005
21    2005
22    2005
23    2005
24    2005
25    2005
26    2005
27    2005
Name: Useid, dtype: float64

If some names may not be used you can use a default value using dict.get.

This is faster than the suggested groupby:

In [12]: %%timeit                                 
df = pd.read_csv("in.csv")
ids = df.groupby(['FName', 'LName']).id.apply(list)
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName][-1]), axis=1)
   ....: 
100 loops, best of 3: 5.27 ms per loop

In [13]: %%timeit                                 
df = pd.read_csv("in.csv")
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
df["Usedid"] = [d[(f, l)] for f, l in zip(df["UsedFName"], df["UsedLName"])]
   ....: 

1000 loops, best of 3: 1.41 ms per loop
like image 195
Padraic Cunningham Avatar answered Dec 06 '25 05:12

Padraic Cunningham



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!