Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to do fuzzy map based on a mix of substring and string in pandas

I have two dataframes mapp and data like as shown below

mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]})

data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})

enter image description here

enter image description here

I would like fetch a value from data and check whether it is present in mapp, if yes, then get the corresponding concept_id value. The priority is to first look for an exact match. If no match is found, then go for substring match. As I am dealing with more than million records, any scalabale solution is helpful

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s) 

produces an output like below

enter image description here

When I do substring match, valid records also become NA as shown below

data['concept_id'] = data['sourcevalue'].str[:3].map(s)

enter image description here

I don't know why it's giving NA for valid records now

How can I do this two checks at once in an elegant and efficient manner?

I expect my output to be like as shown below

enter image description here

like image 777
The Great Avatar asked Dec 22 '25 04:12

The Great


1 Answers

If need map by strings and first 3 letters create 2 separate Series and then use Series.fillna or Series.combine_first for replace missing values from a by b:

s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s) 
b = data['sourcevalue'].str[:3].map(s)

data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
    sourcevalue  concept_id
0  d22heartabcd         1.0
1       Studyid         2.0
2      noofsons         3.0
3         Level         4.0
4     d21abcdef         5.0

EDIT:

#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22         1
Studyid     2
noofsons    3
Level       4
d21         5
Name: concept_id, dtype: int64

#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22    1
Stu    2
noo    3
Lev    4
d21    5
Name: concept_id, dtype: int64

#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0    1.0
1    NaN
2    NaN
3    NaN
4    5.0
Name: sourcevalue, dtype: float64

#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0    1
1    2
2    3
3    4
4    5
Name: sourcevalue, dtype: int64
like image 186
jezrael Avatar answered Dec 23 '25 17:12

jezrael