I am trying to insert fake data into this table. It can't be totally random because the rows would need to make sense. I'll explain below.
My data looks like this:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C382861922 | ACTIVE | 2016-05-25 | None |
C382861922 | INACTIVE | None | None |
C382861922 | ACTIVE | None | None |
C382861922 | INACTIVE | None | 2021-12-31 |
C429768513 | ACTIVE | 2015-12-27 | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | 2021-12-31 |
C643625629 | ACTIVE | 2016-07-24 | None |
C643625629 | INACTIVE | None | None |
C643625629 | ACTIVE | None | 2021-12-31 |
C82157435 | ACTIVE | 2016-10-22 | None |
C82157435 | INACTIVE | None | 2021-12-31 |
Each AcctID can appear multiple times, but it's easiest to explain what I'm doing with just an example where the AcctID appears twice:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C82157435 | ACTIVE | 2016-10-22 | None |
C82157435 | INACTIVE | None | 2021-12-31 |
My goal is to randomly pick a date where this customer changed their account_status, which would become both the end_date of the first row and the start_date of the 2nd row. So, I only need to pick 1 random date, and insert it in both places. Easy enough - I can max() and min() and then calculate the difference in days, and then choose a random integer within that range.
However, I can't figure out how I'd do it for a customer with using more than 2 records:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C429768513 | ACTIVE | 2015-12-27 | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | 2021-12-31 |
There will be several places to choose a random date, but since they need to correspond to each other, the problem becomes really complex. Any ideas?
Here's code to create the sample dataframe:
import pandas as pd
fake = [
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "2015-12-27",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "2021-12-31"
}
]
df = pd.DataFrame(fake)
Edit: Here's a fake example of what the program output could look like. Please note most of the dates are randomly chosen - but the end date of the preceding row matches the start date of the next row.
AcctID | account_status | start_date | end_date |
---|---|---|---|
C429768513 | ACTIVE | 2015-12-27 | 2016-01-05 |
C429768513 | INACTIVE | 2016-01-05 | 2016-03-01 |
C429768513 | ACTIVE | 2016-03-01 | 2017-06-22 |
C429768513 | INACTIVE | 2017-06-22 | 2017-09-04 |
C429768513 | ACTIVE | 2017-09-04 | 2018-10-27 |
C429768513 | INACTIVE | 2018-10-27 | 2019-04-04 |
C429768513 | ACTIVE | 2019-04-04 | 2020-06-06 |
C429768513 | INACTIVE | 2020-06-06 | 2021-12-31 |
One way to solve this problem :
df = df.replace(to_replace='None', value=np.nan)
def random_date(x):
s_d = pd.to_datetime(x[x['start_date'].notna()]['start_date'])
e_d = pd.to_datetime(x[x['end_date'].notna()]['end_date'])
start_u = s_d.iloc[0].value//10**9
end_u = e_d.iloc[0].value//10**9
end_date_list = sorted(pd.to_datetime(np.random.randint(start_u, end_u, len(x)-1), unit='s').values)
end_date_list = np.append(end_date_list, e_d.values)
x['end_date'] = end_date_list
mask = x['start_date'].isna()
x.loc[mask,'start_date'] = x.shift(1).loc[mask]['end_date'].astype(str)
x['start_date'] = pd.to_datetime(x['start_date']).dt.date
x['end_date'] = pd.to_datetime(x['end_date']).dt.date
return x
df = df.groupby('AcctID').apply(random_date)
output:
AcctID account_status start_date end_date
0 C382861922 ACTIVE 2016-05-25 2016-12-23
1 C382861922 INACTIVE 2016-12-23 2017-12-28
2 C382861922 ACTIVE 2017-12-28 2019-04-24
3 C382861922 INACTIVE 2019-04-24 2021-12-31
4 C429768513 ACTIVE 2015-12-27 2017-12-04
5 C429768513 INACTIVE 2017-12-04 2019-01-07
6 C429768513 ACTIVE 2019-01-07 2019-04-03
7 C429768513 INACTIVE 2019-04-03 2020-06-13
8 C429768513 ACTIVE 2020-06-13 2021-02-13
9 C429768513 INACTIVE 2021-02-13 2021-03-09
10 C429768513 ACTIVE 2021-03-09 2021-08-09
11 C429768513 INACTIVE 2021-08-09 2021-12-31
12 C643625629 ACTIVE 2016-07-24 2021-02-27
13 C643625629 INACTIVE 2021-02-27 2021-05-20
14 C643625629 ACTIVE 2021-05-20 2021-12-31
15 C82157435 ACTIVE 2016-10-22 2021-02-20
16 C82157435 INACTIVE 2021-02-20 2021-12-31
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With