Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant and Efficient way to map dates from one dataframe to another - Big data

I have two dataframes as given below

t1 = pd.DataFrame({'person_id':[1,2,3],'observation_date':[np.nan,np.nan,np.nan],'observation_datetime':[np.nan,np.nan,np.nan]})

t2 = pd.DataFrame({'person_id':[1,2,3],'value_as_string':['5/28/2007','5/30/2007','6/4/2007']}).set_index('person_id')['value_as_string']

They look like as shown below

enter image description here

This is what I tried to get the output

t1['observation_date'] = t1['person_id'].map(t2)
t1['observation_date'] = pd.to_datetime(t1['observation_date'])
t1['observation_datetime'] = pd.to_datetime(t1['observation_date']).dt.strftime('%m/%d/%Y %H:%M:%S')

Though this works fine, it takes lot of time in real data

Please note that I am trying to do this on t1 dataframe of size 1 million records and t2 dataframe of size of 15k records. So any efficient approach would be helpful

I expect my output dataframe to look like as shown below

enter image description here

like image 559
The Great Avatar asked Sep 03 '25 03:09

The Great


2 Answers

ids = list(range(1, 15000))
dte = ['5/28/2007','5/30/2007','6/4/2007'] * 5000
t1 = pd.DataFrame({'person_id': ids})
t2 = pd.DataFrame({'person_id': ids, 
                   'value_as_string': dte)

Merge approach

x = t1.merge(t2, how='left', on='person_id', how='left')
# 5.19 ms ± 408 µs per loop

Join approach

x = t1.set_index('person_id').join(df2.set_index('person_id'), how='left') 
# 3.02 ms ± 91.4 µs per loop

Map approach with dict

t1['observation_date'] = t1['person_id'].map(
       t2.set_index('person_id')['value_as_string'].to_dict())
# 2.73 ms ± 240 µs per loop

Map approach without dict

t1['observation_date'] = t1['person_id'].map(t2.set_index('person_id')['value_as_string'])
# 2.33 ms ± 260 µs per loop

So

t1['observation_date'] = pd.to_datetime(
        t1['person_id'].map(t2.set_index('person_id')['value_as_string']))
t1['observation_datetime'] = t1['observation_date'].dt.strftime('%m/%d/%Y %H:%M:%S')
like image 67
J. Doe Avatar answered Sep 05 '25 22:09

J. Doe


I have a workaround for your problem. Instead of mapping, why don't you use a faster approach like a merge in pandas? I have used it on records ranging in close to a million and it is surprisingly fast.

The process of merging begins with two dataframes. Try doing

df =  t1.merge(t2, on = 'person_id', how='inner')

this would do an inner join on person_id on both columns in both dataframes(t1 and t2). You would get a new column introduced in resultant dataframe. You can then use simple column manipulation to fill the value in the target column.

Hope that helped.

like image 39
Heathcliff Avatar answered Sep 05 '25 21:09

Heathcliff