I have two dataframes mapp and data like as shown below
mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]})
data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})


I would like fetch a value from data and check whether it is present in mapp, if yes, then get the corresponding concept_id value. The priority is to first look for an exact match. If no match is found, then go for substring match. As I am dealing with more than million records, any scalabale solution is helpful
s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)
produces an output like below

When I do substring match, valid records also become NA as shown below
data['concept_id'] = data['sourcevalue'].str[:3].map(s)

I don't know why it's giving NA for valid records now
How can I do this two checks at once in an elegant and efficient manner?
I expect my output to be like as shown below

If need map by strings and first 3 letters create 2 separate Series and then use Series.fillna or Series.combine_first for replace missing values from a by b:
s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s)
b = data['sourcevalue'].str[:3].map(s)
data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
sourcevalue concept_id
0 d22heartabcd 1.0
1 Studyid 2.0
2 noofsons 3.0
3 Level 4.0
4 d21abcdef 5.0
EDIT:
#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22 1
Studyid 2
noofsons 3
Level 4
d21 5
Name: concept_id, dtype: int64
#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22 1
Stu 2
noo 3
Lev 4
d21 5
Name: concept_id, dtype: int64
#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0 1.0
1 NaN
2 NaN
3 NaN
4 5.0
Name: sourcevalue, dtype: float64
#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0 1
1 2
2 3
3 4
4 5
Name: sourcevalue, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With