Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating fake data using date logic

I am trying to insert fake data into this table. It can't be totally random because the rows would need to make sense. I'll explain below.

My data looks like this:

AcctID account_status start_date end_date
C382861922 ACTIVE 2016-05-25 None
C382861922 INACTIVE None None
C382861922 ACTIVE None None
C382861922 INACTIVE None 2021-12-31
C429768513 ACTIVE 2015-12-27 None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None 2021-12-31
C643625629 ACTIVE 2016-07-24 None
C643625629 INACTIVE None None
C643625629 ACTIVE None 2021-12-31
C82157435 ACTIVE 2016-10-22 None
C82157435 INACTIVE None 2021-12-31

Each AcctID can appear multiple times, but it's easiest to explain what I'm doing with just an example where the AcctID appears twice:

AcctID account_status start_date end_date
C82157435 ACTIVE 2016-10-22 None
C82157435 INACTIVE None 2021-12-31

My goal is to randomly pick a date where this customer changed their account_status, which would become both the end_date of the first row and the start_date of the 2nd row. So, I only need to pick 1 random date, and insert it in both places. Easy enough - I can max() and min() and then calculate the difference in days, and then choose a random integer within that range.

However, I can't figure out how I'd do it for a customer with using more than 2 records:

AcctID account_status start_date end_date
C429768513 ACTIVE 2015-12-27 None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None None
C429768513 ACTIVE None None
C429768513 INACTIVE None 2021-12-31

There will be several places to choose a random date, but since they need to correspond to each other, the problem becomes really complex. Any ideas?

Here's code to create the sample dataframe:

import pandas as pd

fake = [
 {
   "AcctID": "C429768513",
   "account_status": "ACTIVE",
   "start_date": "2015-12-27",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "INACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "ACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "INACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "ACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "INACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "ACTIVE",
   "start_date": "None",
   "end_date": "None"
 },
 {
   "AcctID": "C429768513",
   "account_status": "INACTIVE",
   "start_date": "None",
   "end_date": "2021-12-31"
 }
]

df = pd.DataFrame(fake)

Edit: Here's a fake example of what the program output could look like. Please note most of the dates are randomly chosen - but the end date of the preceding row matches the start date of the next row.

AcctID account_status start_date end_date
C429768513 ACTIVE 2015-12-27 2016-01-05
C429768513 INACTIVE 2016-01-05 2016-03-01
C429768513 ACTIVE 2016-03-01 2017-06-22
C429768513 INACTIVE 2017-06-22 2017-09-04
C429768513 ACTIVE 2017-09-04 2018-10-27
C429768513 INACTIVE 2018-10-27 2019-04-04
C429768513 ACTIVE 2019-04-04 2020-06-06
C429768513 INACTIVE 2020-06-06 2021-12-31
like image 404
Josh Avatar asked Oct 14 '22 21:10

Josh


1 Answers

One way to solve this problem :

df = df.replace(to_replace='None', value=np.nan)

def random_date(x):
    s_d = pd.to_datetime(x[x['start_date'].notna()]['start_date'])
    e_d = pd.to_datetime(x[x['end_date'].notna()]['end_date'])
    
    start_u = s_d.iloc[0].value//10**9
    end_u = e_d.iloc[0].value//10**9
    end_date_list = sorted(pd.to_datetime(np.random.randint(start_u, end_u, len(x)-1), unit='s').values)
    end_date_list = np.append(end_date_list, e_d.values)
    x['end_date'] = end_date_list
    mask = x['start_date'].isna()
    x.loc[mask,'start_date'] = x.shift(1).loc[mask]['end_date'].astype(str)
    x['start_date'] = pd.to_datetime(x['start_date']).dt.date
    x['end_date'] = pd.to_datetime(x['end_date']).dt.date
    return x

df = df.groupby('AcctID').apply(random_date)

output:

        AcctID account_status  start_date    end_date
0   C382861922         ACTIVE  2016-05-25  2016-12-23
1   C382861922       INACTIVE  2016-12-23  2017-12-28
2   C382861922         ACTIVE  2017-12-28  2019-04-24
3   C382861922       INACTIVE  2019-04-24  2021-12-31
4   C429768513         ACTIVE  2015-12-27  2017-12-04
5   C429768513       INACTIVE  2017-12-04  2019-01-07
6   C429768513         ACTIVE  2019-01-07  2019-04-03
7   C429768513       INACTIVE  2019-04-03  2020-06-13
8   C429768513         ACTIVE  2020-06-13  2021-02-13
9   C429768513       INACTIVE  2021-02-13  2021-03-09
10  C429768513         ACTIVE  2021-03-09  2021-08-09
11  C429768513       INACTIVE  2021-08-09  2021-12-31
12  C643625629         ACTIVE  2016-07-24  2021-02-27
13  C643625629       INACTIVE  2021-02-27  2021-05-20
14  C643625629         ACTIVE  2021-05-20  2021-12-31
15   C82157435         ACTIVE  2016-10-22  2021-02-20
16   C82157435       INACTIVE  2021-02-20  2021-12-31
like image 120
Nk03 Avatar answered Oct 21 '22 03:10

Nk03