This is a self-answered post. A common problem is to randomly generate dates between a given start and end date.
There are two cases to consider:
For example, given some start date 2015-01-01
and an end date 2018-01-01
, how can I sample N random dates between this range using pandas?
randn + to_timedelta. This addresses Case (1). You can do this by generating a random array of timedelta objects and adding them to your start date. This will generate dates with a time component as well.
Is converting to the unix timestamp acceptable?
def random_dates(start, end, n=10): start_u = start.value//10**9 end_u = end.value//10**9 return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
Sample run:
start = pd.to_datetime('2015-01-01') end = pd.to_datetime('2018-01-01') random_dates(start, end) DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48', '2015-01-24 10:11:04', '2015-03-26 16:23:53', '2017-04-01 00:38:21', '2015-05-15 03:47:54', '2015-06-24 07:32:32', '2015-11-10 20:39:36', '2016-07-25 05:48:09', '2015-03-19 16:05:19'], dtype='datetime64[ns]', freq=None)
EDIT:
As per the comment by @smci, I wrote a function to accommodate both 1 and 2 with a little explanation inside the function itself.
def random_datetimes_or_dates(start, end, out_format='datetime', n=10): ''' unix timestamp is in ns by default. I divide the unix time value by 10**9 to make it seconds (or 24*60*60*10**9 to make it days). The corresponding unit variable is passed to the pd.to_datetime function. Values for the (divide_by, unit) pair to select is defined by the out_format parameter. for 1 -> out_format='datetime' for 2 -> out_format=anything else ''' (divide_by, unit) = (10**9, 's') if out_format=='datetime' else (24*60*60*10**9, 'D') start_u = start.value//divide_by end_u = end.value//divide_by return pd.to_datetime(np.random.randint(start_u, end_u, n), unit=unit)
Sample run:
random_datetimes_or_dates(start, end, out_format='datetime') DatetimeIndex(['2017-01-30 05:14:27', '2016-10-18 21:17:16', '2016-10-20 08:38:02', '2015-09-02 00:03:08', '2015-06-04 02:38:12', '2016-02-19 05:22:01', '2015-11-06 10:37:10', '2017-12-17 03:26:02', '2017-11-20 06:51:32', '2016-01-02 02:48:03'], dtype='datetime64[ns]', freq=None) random_datetimes_or_dates(start, end, out_format='not datetime') DatetimeIndex(['2017-05-10', '2017-12-31', '2017-11-10', '2015-05-02', '2016-04-11', '2015-11-27', '2015-03-29', '2017-05-21', '2015-05-11', '2017-02-08'], dtype='datetime64[ns]', freq=None)
np.random.randn
+ to_timedelta
This addresses Case (1). You can do this by generating a random array of timedelta
objects and adding them to your start
date.
def random_dates(start, end, n, unit='D', seed=None): if not seed: # from piR's answer np.random.seed(0) ndays = (end - start).days + 1 return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
>>> np.random.seed(0) >>> start = pd.to_datetime('2015-01-01') >>> end = pd.to_datetime('2018-01-01') >>> random_dates(start, end, 10) DatetimeIndex([ '2016-08-25 01:09:42.969600', '2017-02-23 13:30:20.304000', '2016-10-23 05:33:15.033600', '2016-08-20 17:41:04.012799999', '2016-04-09 17:59:00.815999999', '2016-12-09 13:06:00.748800', '2016-04-25 00:47:45.974400', '2017-09-05 06:35:58.444800', '2017-11-23 03:18:47.347200', '2016-02-25 15:14:53.894400'], dtype='datetime64[ns]', freq=None)
This will generate dates with a time component as well.
Sadly, rand
does not support a replace=False
, so if you want unique dates, you'll need a two-step process of 1) generate the non-unique days component, and 2) generate the unique seconds/milliseconds component, then add the two together.
np.random.randint
+ to_timedelta
This addresses Case (2). You can modify random_dates
above to generate random integers instead of random floats:
def random_dates2(start, end, n, unit='D', seed=None): if not seed: # from piR's answer np.random.seed(0) ndays = (end - start).days + 1 return start + pd.to_timedelta( np.random.randint(0, ndays, n), unit=unit )
>>> random_dates2(start, end, 10) DatetimeIndex(['2016-11-15', '2016-07-13', '2017-04-15', '2017-02-02', '2017-10-30', '2015-10-05', '2016-08-22', '2017-12-30', '2016-08-23', '2015-11-11'], dtype='datetime64[ns]', freq=None)
To generate dates with other frequencies, the functions above can be called with a different value for unit
. Additionally, you can add a parameter freq
and tweak your function call as needed.
If you want unique random dates, you can use np.random.choice
with replace=False
:
def random_dates2_unique(start, end, n, unit='D', seed=None): if not seed: # from piR's answer np.random.seed(0) ndays = (end - start).days + 1 return start + pd.to_timedelta( np.random.choice(ndays, n, replace=False), unit=unit )
Going to benchmark just the methods that address Case (1), since Case (2) is really a special case which any method can get to using dt.floor
.
Functions
def cs(start, end, n): ndays = (end - start).days + 1 return pd.to_timedelta(np.random.rand(n) * ndays, unit='D') + start def akilat90(start, end, n): start_u = start.value//10**9 end_u = end.value//10**9 return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') def piR(start, end, n): dr = pd.date_range(start, end, freq='H') # can't get better than this :-( return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False))) def piR2(start, end, n): dr = pd.date_range(start, end, freq='H') a = np.arange(len(dr)) b = np.sort(np.random.permutation(a)[:n]) return dr[b]
Benchmarking Code
from timeit import timeit import pandas as pd import matplotlib.pyplot as plt res = pd.DataFrame( index=['cs', 'akilat90', 'piR', 'piR2'], columns=[10, 20, 50, 100, 200, 500, 1000, 2000, 5000], dtype=float ) for f in res.index: for c in res.columns: np.random.seed(0) start = pd.to_datetime('2015-01-01') end = pd.to_datetime('2018-01-01') stmt = '{}(start, end, c)'.format(f) setp = 'from __main__ import start, end, c, {}'.format(f) res.at[f, c] = timeit(stmt, setp, number=30) ax = res.div(res.min()).T.plot(loglog=True) ax.set_xlabel("N"); ax.set_ylabel("time (relative)"); plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With