I have two dataframes as given below
t1 = pd.DataFrame({'person_id':[1,2,3],'observation_date':[np.nan,np.nan,np.nan],'observation_datetime':[np.nan,np.nan,np.nan]})
t2 = pd.DataFrame({'person_id':[1,2,3],'value_as_string':['5/28/2007','5/30/2007','6/4/2007']}).set_index('person_id')['value_as_string']
They look like as shown below
This is what I tried to get the output
t1['observation_date'] = t1['person_id'].map(t2)
t1['observation_date'] = pd.to_datetime(t1['observation_date'])
t1['observation_datetime'] = pd.to_datetime(t1['observation_date']).dt.strftime('%m/%d/%Y %H:%M:%S')
Though this works fine, it takes lot of time in real data
Please note that I am trying to do this on t1
dataframe of size 1 million records and t2
dataframe of size of 15k records. So any efficient approach would be helpful
I expect my output dataframe to look like as shown below
ids = list(range(1, 15000))
dte = ['5/28/2007','5/30/2007','6/4/2007'] * 5000
t1 = pd.DataFrame({'person_id': ids})
t2 = pd.DataFrame({'person_id': ids,
'value_as_string': dte)
Merge approach
x = t1.merge(t2, how='left', on='person_id', how='left')
# 5.19 ms ± 408 µs per loop
Join approach
x = t1.set_index('person_id').join(df2.set_index('person_id'), how='left')
# 3.02 ms ± 91.4 µs per loop
Map approach with dict
t1['observation_date'] = t1['person_id'].map(
t2.set_index('person_id')['value_as_string'].to_dict())
# 2.73 ms ± 240 µs per loop
Map approach without dict
t1['observation_date'] = t1['person_id'].map(t2.set_index('person_id')['value_as_string'])
# 2.33 ms ± 260 µs per loop
So
t1['observation_date'] = pd.to_datetime(
t1['person_id'].map(t2.set_index('person_id')['value_as_string']))
t1['observation_datetime'] = t1['observation_date'].dt.strftime('%m/%d/%Y %H:%M:%S')
I have a workaround for your problem. Instead of mapping, why don't you use a faster approach like a merge in pandas? I have used it on records ranging in close to a million and it is surprisingly fast.
The process of merging begins with two dataframes. Try doing
df = t1.merge(t2, on = 'person_id', how='inner')
this would do an inner join on person_id on both columns in both dataframes(t1 and t2). You would get a new column introduced in resultant dataframe. You can then use simple column manipulation to fill the value in the target column.
Hope that helped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With