Generating random dates within a given range in pandas

2 Answers

Is converting to the unix timestamp acceptable?

def random_dates(start, end, n=10):      start_u = start.value//10**9     end_u = end.value//10**9      return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

Sample run:

Click to copy

start = pd.to_datetime('2015-01-01') end = pd.to_datetime('2018-01-01') random_dates(start, end)  DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',                '2015-01-24 10:11:04', '2015-03-26 16:23:53',                '2017-04-01 00:38:21', '2015-05-15 03:47:54',                '2015-06-24 07:32:32', '2015-11-10 20:39:36',                '2016-07-25 05:48:09', '2015-03-19 16:05:19'],               dtype='datetime64[ns]', freq=None)

EDIT:

As per the comment by @smci, I wrote a function to accommodate both 1 and 2 with a little explanation inside the function itself.

Click to copy

def random_datetimes_or_dates(start, end, out_format='datetime', n=10):       '''        unix timestamp is in ns by default.      I divide the unix time value by 10**9 to make it seconds (or 24*60*60*10**9 to make it days).     The corresponding unit variable is passed to the pd.to_datetime function.      Values for the (divide_by, unit) pair to select is defined by the out_format parameter.     for 1 -> out_format='datetime'     for 2 -> out_format=anything else     '''     (divide_by, unit) = (10**9, 's') if out_format=='datetime' else (24*60*60*10**9, 'D')      start_u = start.value//divide_by     end_u = end.value//divide_by      return pd.to_datetime(np.random.randint(start_u, end_u, n), unit=unit)

Sample run:

Click to copy

random_datetimes_or_dates(start, end, out_format='datetime')  DatetimeIndex(['2017-01-30 05:14:27', '2016-10-18 21:17:16',                '2016-10-20 08:38:02', '2015-09-02 00:03:08',                '2015-06-04 02:38:12', '2016-02-19 05:22:01',                     '2015-11-06 10:37:10', '2017-12-17 03:26:02',                    '2017-11-20 06:51:32', '2016-01-02 02:48:03'],                   dtype='datetime64[ns]', freq=None)  random_datetimes_or_dates(start, end, out_format='not datetime')  DatetimeIndex(['2017-05-10', '2017-12-31', '2017-11-10', '2015-05-02',                '2016-04-11', '2015-11-27', '2015-03-29', '2017-05-21',                '2015-05-11', '2017-02-08'],               dtype='datetime64[ns]', freq=None)

answered Oct 16 '22 13:10

akilat90

`np.random.randn` + `to_timedelta`

This addresses Case (1). You can do this by generating a random array of timedelta objects and adding them to your start date.

Click to copy

def random_dates(start, end, n, unit='D', seed=None):     if not seed:  # from piR's answer         np.random.seed(0)      ndays = (end - start).days + 1     return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start

Click to copy

>>> np.random.seed(0) >>> start = pd.to_datetime('2015-01-01') >>> end = pd.to_datetime('2018-01-01') >>> random_dates(start, end, 10) DatetimeIndex([   '2016-08-25 01:09:42.969600',                   '2017-02-23 13:30:20.304000',                   '2016-10-23 05:33:15.033600',                '2016-08-20 17:41:04.012799999',                '2016-04-09 17:59:00.815999999',                   '2016-12-09 13:06:00.748800',                   '2016-04-25 00:47:45.974400',                   '2017-09-05 06:35:58.444800',                   '2017-11-23 03:18:47.347200',                   '2016-02-25 15:14:53.894400'],               dtype='datetime64[ns]', freq=None)

This will generate dates with a time component as well.

Sadly, rand does not support a replace=False, so if you want unique dates, you'll need a two-step process of 1) generate the non-unique days component, and 2) generate the unique seconds/milliseconds component, then add the two together.

`np.random.randint` + `to_timedelta`

This addresses Case (2). You can modify random_dates above to generate random integers instead of random floats:

Click to copy

def random_dates2(start, end, n, unit='D', seed=None):     if not seed:  # from piR's answer         np.random.seed(0)      ndays = (end - start).days + 1     return start + pd.to_timedelta(         np.random.randint(0, ndays, n), unit=unit     )

Click to copy

>>> random_dates2(start, end, 10) DatetimeIndex(['2016-11-15', '2016-07-13', '2017-04-15', '2017-02-02',                '2017-10-30', '2015-10-05', '2016-08-22', '2017-12-30',                '2016-08-23', '2015-11-11'],               dtype='datetime64[ns]', freq=None)

To generate dates with other frequencies, the functions above can be called with a different value for unit. Additionally, you can add a parameter freq and tweak your function call as needed.

If you want unique random dates, you can use np.random.choice with replace=False:

Click to copy

def random_dates2_unique(start, end, n, unit='D', seed=None):     if not seed:  # from piR's answer         np.random.seed(0)      ndays = (end - start).days + 1     return start + pd.to_timedelta(         np.random.choice(ndays, n, replace=False), unit=unit     )

Performance

Going to benchmark just the methods that address Case (1), since Case (2) is really a special case which any method can get to using dt.floor.

enter image description here Functions

Click to copy

def cs(start, end, n):     ndays = (end - start).days + 1     return pd.to_timedelta(np.random.rand(n) * ndays, unit='D') + start  def akilat90(start, end, n):     start_u = start.value//10**9     end_u = end.value//10**9      return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')  def piR(start, end, n):     dr = pd.date_range(start, end, freq='H') # can't get better than this :-(     return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))  def piR2(start, end, n):     dr = pd.date_range(start, end, freq='H')     a = np.arange(len(dr))     b = np.sort(np.random.permutation(a)[:n])     return dr[b]

Benchmarking Code

Click to copy

from timeit import timeit  import pandas as pd import matplotlib.pyplot as plt  res = pd.DataFrame(        index=['cs', 'akilat90', 'piR', 'piR2'],        columns=[10, 20, 50, 100, 200, 500, 1000, 2000, 5000],        dtype=float )  for f in res.index:      for c in res.columns:         np.random.seed(0)          start = pd.to_datetime('2015-01-01')         end = pd.to_datetime('2018-01-01')          stmt = '{}(start, end, c)'.format(f)         setp = 'from __main__ import start, end, c, {}'.format(f)         res.at[f, c] = timeit(stmt, setp, number=30)  ax = res.div(res.min()).T.plot(loglog=True)  ax.set_xlabel("N");  ax.set_ylabel("time (relative)");  plt.show()

answered Oct 16 '22 12:10

cs95

Related questions
                            
                                Is there an R equivalent of the pythonic "if __name__ == "__main__": main()"?
                            
                                Python: How to show matplotlib in flask [duplicate]
                            
                                Using Numpy Vectorize on Functions that Return Vectors
                            
                                Why is variable1 += variable2 much faster than variable1 = variable1 + variable2?
                            
                                How to rearrange array based upon index array
                            
                                Using Merge on a column and Index in Pandas
                            
                                Returning multiple values from pandas apply on a DataFrame
                            
                                Why is startswith slower than slicing
                            
                                Apply function to pandas groupby
                            
                                Relative import in Python 3 is not working [duplicate]
                            
                                How to handle a broken pipe (SIGPIPE) in python?
                            
                                How to unquote a urlencoded unicode string in python?
                            
                                Permission problems when creating a dir with os.makedirs in Python
                            
                                Python pandas: Add a column to my dataframe that counts a variable
                            
                                constants in Python: at the root of the module or in a namespace inside the module?
                            
                                Automatic version number both in setup.py (setuptools) AND source code?
                            
                                Replace value for a selected cell in pandas DataFrame without using index
                            
                                .pyw files in python program
                            
                                python, unittest: is there a way to pass command line options to the app
                            
                                Finding count of distinct elements in DataFrame in each column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generating random dates within a given range in pandas

Tags:

python

datetime

random

pandas

cs95

People also ask

2 Answers

akilat90

`np.random.randn` + `to_timedelta`

`np.random.randint` + `to_timedelta`

Performance

cs95

Recent Activity

Donate For Us

Generating random dates within a given range in pandas

Tags:

python

datetime

random

pandas

cs95

People also ask

2 Answers

akilat90

np.random.randn + to_timedelta

np.random.randint + to_timedelta

Performance

cs95

Related questions

Recent Activity

Donate For Us

`np.random.randn` + `to_timedelta`

`np.random.randint` + `to_timedelta`