Here is a dataframe of financial instrument identifiers
import pandas as pd
import numpy as np
df = pd.DataFrame([["ISIN1", "CUSIP1", "SEDOL1"],
["ISIN2", "CUSIP2", "SEDOL2"],
["ISIN3", "CUSIP3", "SEDOL3"],
["ISIN4", "CUSIP4", "SEDOL4"]],
columns=["ISIN", "CUSIP", "SEDOL"])
df
ISIN CUSIP SEDOL
0 ISIN1 CUSIP1 SEDOL1
1 ISIN2 CUSIP2 SEDOL2
2 ISIN3 CUSIP3 SEDOL3
3 ISIN4 CUSIP4 SEDOL4
Imagine that several entries are missing
df.iloc[(1,1)] = np.nan
df.iloc[(1,2)] = np.nan
df.iloc[(2,0)] = np.nan
df.iloc[(3,0)] = np.nan
df.iloc[(3,1)] = np.nan
df
ISIN CUSIP SEDOL
0 ISIN1 CUSIP1 SEDOL3
1 ISIN2 NaN NaN
2 NaN CUSIP3 SEDOL3
3 NaN NaN SEDOL4
In column ID I want to capture a single variable based on this hierarchy : If ISIN is missing, I want to populate CUSIP. If CUSIP is also missing I want to populate SEDOL.
I tried this nested if statement:
def identifier(row):
if ~pd.isnull(row['ISIN']):
return row['ISIN']
elif pd.isnull(row['ISIN']) & ~pd.isnull(row['CUSIP']):
return row['CUSIP']
elif pd.isnull(row['ISIN']) & pd.isnull(row['CUSIP']) & ~pd.isnull(row['SEDOL']):
return row['SEDOL']
df['ID'] = df[['SEDOL', 'CUSIP', 'ISIN']].apply(identifier, axis=1)
Which returned bad output in the ID column for the last 2 entries.
ISIN CUSIP SEDOL ID
0 ISIN1 CUSIP1 SEDOL1 ISIN1
1 ISIN2 NaN NaN ISIN2
2 NaN CUSIP3 SEDOL3 NaN
3 NaN NaN SEDOL4 NaN
My expected output is this:
ISIN CUSIP SEDOL ID
0 ISIN1 CUSIP1 SEDOL1 ISIN1
1 ISIN2 NaN NaN ISIN2
2 NaN CUSIP3 SEDOL3 CUSIP3
3 NaN NaN SEDOL4 SEDOL4
Hope i've explained that well. Note that 'ISIN' is a string. There is no use of the .isin function in my code. Thank you in advance.
In general, you can implement elif logic using np.select, which is detailed here and here.
In this case you can do this succinctly with lookup + notnull().idxmax to find the first non-null value in each row. I added an extra all NaN row to show how that is treated.
df['ID'] = df.lookup(df.index, df.notnull().idxmax(1))
# ISIN CUSIP SEDOL ID
#0 ISIN1 CUSIP1 SEDOL1 ISIN1
#1 ISIN2 NaN NaN ISIN2
#2 NaN CUSIP3 SEDOL3 CUSIP3
#3 NaN NaN SEDOL4 SEDOL4
#4 NaN NaN NaN NaN
To explain the issue with your original problem, it's the use of ~ with pd.isnull.
df['ISIN'].apply(lambda x: ~pd.isnull(x))
0 -1
1 -1
2 -2
3 -2
4 -2
Name: ISIN, dtype: int64
These are not 0, so they all evaluate to True, meaning for every row you take the 'ISIN' column. You should use pd.notnull (not pd.isnull also works) instead of ~pd.isnull
df['ISIN'].apply(lambda x: pd.notnull(x))
0 True
1 True
2 False
3 False
4 False
Name: ISIN, dtype: bool
IIUC using bfill
df['ID']=df.bfill(1).iloc[:,0]
df
Out[346]:
ISIN CUSIP SEDOL ID
0 ISIN1 CUSIP1 SEDOL3 ISIN1
1 ISIN2 NaN NaN ISIN2
2 NaN CUSIP3 SEDOL3 CUSIP3
3 NaN NaN SEDOL4 SEDOL4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With