I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched row, 1) the first time stamp is always smaller or equal to the second timestamp, and 2) the matched timestamps are the closest for all pairs of timestamps satisfying 1).
Is there any way to do this with pandas.merge?
merge()
can't do this kind of join, but you can use searchsorted()
:
Create some random timestamps: t1
, t2
, there are in ascending order:
import pandas as pd
import numpy as np
np.random.seed(0)
base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")
a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
t1 = base + a
t1.sort()
b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
t2 = base + b
t2.sort()
call searchsorted()
to find index in t1
for every value in t2
:
idx = np.searchsorted(t1, t2) - 1
mask = idx >= 0
df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})
here is the output:
t1 t2
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000
To view this result by graph:
import pylab as pl
pl.figure(figsize=(18, 4))
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
pl.margins(0.02)
output:
The green lines are t1
, blue lines are t2
, red lines are selected from t1
for every t2
.
Pandas now has the function merge_asof
, doing exactly what was described in the accepted answer.
I used a different way than HYRY:
All this can be written in few lines:
df=pd.merge(df0, df1, on='Date', how='outer')
df=df.sort(['Date'], ascending=[1])
headertofill=list(df1.columns.values)
df[headertofill]=df[headertofill].fillna(method='pad')
df=df[pd.isnull(df[var_from_df0_only])==False]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With