Elegant and Efficient way to map dates from one dataframe to another - Big data

Question

I have two dataframes as given below

t1 = pd.DataFrame({'person_id':[1,2,3],'observation_date':[np.nan,np.nan,np.nan],'observation_datetime':[np.nan,np.nan,np.nan]})

t2 = pd.DataFrame({'person_id':[1,2,3],'value_as_string':['5/28/2007','5/30/2007','6/4/2007']}).set_index('person_id')['value_as_string']

They look like as shown below

enter image description here

This is what I tried to get the output

t1['observation_date'] = t1['person_id'].map(t2)
t1['observation_date'] = pd.to_datetime(t1['observation_date'])
t1['observation_datetime'] = pd.to_datetime(t1['observation_date']).dt.strftime('%m/%d/%Y %H:%M:%S')

Though this works fine, it takes lot of time in real data

Please note that I am trying to do this on t1 dataframe of size 1 million records and t2 dataframe of size of 15k records. So any efficient approach would be helpful

I expect my output dataframe to look like as shown below

enter image description here

J. Doe · Accepted Answer

ids = list(range(1, 15000))
dte = ['5/28/2007','5/30/2007','6/4/2007'] * 5000
t1 = pd.DataFrame({'person_id': ids})
t2 = pd.DataFrame({'person_id': ids, 
                   'value_as_string': dte)

Merge approach

x = t1.merge(t2, how='left', on='person_id', how='left')
# 5.19 ms ± 408 µs per loop

Join approach

x = t1.set_index('person_id').join(df2.set_index('person_id'), how='left') 
# 3.02 ms ± 91.4 µs per loop

Map approach with dict

t1['observation_date'] = t1['person_id'].map(
       t2.set_index('person_id')['value_as_string'].to_dict())
# 2.73 ms ± 240 µs per loop

Map approach without dict

t1['observation_date'] = t1['person_id'].map(t2.set_index('person_id')['value_as_string'])
# 2.33 ms ± 260 µs per loop

So

t1['observation_date'] = pd.to_datetime(
        t1['person_id'].map(t2.set_index('person_id')['value_as_string']))
t1['observation_datetime'] = t1['observation_date'].dt.strftime('%m/%d/%Y %H:%M:%S')

Heathcliff · Answer

I have a workaround for your problem. Instead of mapping, why don't you use a faster approach like a merge in pandas? I have used it on records ranging in close to a million and it is surprisingly fast.

The process of merging begins with two dataframes. Try doing

df =  t1.merge(t2, on = 'person_id', how='inner')

this would do an inner join on person_id on both columns in both dataframes(t1 and t2). You would get a new column introduced in resultant dataframe. You can then use simple column manipulation to fill the value in the target column.

Hope that helped.

Elegant and Efficient way to map dates from one dataframe to another - Big data

Tags:

python

datetime

python-3.x

pandas

dataframe

The Great

2 Answers

J. Doe

Heathcliff

Recent Activity

Donate For Us

Elegant and Efficient way to map dates from one dataframe to another - Big data

Tags:

python

datetime

python-3.x

pandas

dataframe

The Great

2 Answers

J. Doe

Heathcliff

Related questions

Recent Activity

Donate For Us