I have a dataframe of about 10,000 rows. Each record includes a Recordid and an integer that represents the number of days since the start for that RecordID. For example:
df = pd.DataFrame( {
"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] ,
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})
df
RecordID number_of_days
0 id1 1
1 id2 1
2 id3 1
3 id4 1
4 id5 1
5 id1 2
6 id2 2
7 id3 2
8 id4 2
9 id5 2
I'd like to add a column that contains a random number between 1 and 100. But I need the random to be higher than the previous random value for each ID. This is an example of what I would like to get:
RecordID number_of_days random_value
0 id1 1 10
1 id1 2 13
2 id1 3 45
3 id1 4 50
4 id1 5 62
5 id1 6 80
6 id1 7 81
7 id1 8 82
8 id1 9 92
9 id1 10 99
10 id2 2 12
11 id2 4 31
I see posts about creating a field with random values. I'm not finding any that address the need for generating random values that increase, though.
You can generate random numbers, sort and then assign to df,
df = pd.DataFrame( {
"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] ,
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})
df['random_value'] = np.sort(np.random.randint(1,100, len(df)))
RecordID number_of_days random_value
0 id1 1 5
1 id2 1 7
2 id3 1 19
3 id4 1 34
4 id5 1 45
5 id1 2 53
6 id2 2 67
7 id3 2 72
8 id4 2 72
9 id5 2 80
Edit: If you want the random_value by group, you can group data by ID and then assign sorted random numbers,
df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index(name = 'random_value')
Generate all random numbers, slice it properly based on the group sizes, sort each slice, and assign back. First we need to sort the DataFrame so that assignment occurs properly.
import numpy as np
import pandas as pd
df = df.sort_values('RecordID')
arr = np.array_split(np.random.randint(1, 100, len(df)),
df.groupby('RecordID').size().cumsum()[:-1])
df['Random_Value'] = np.sort(arr, axis=1).ravel()
RecordID number_of_days Random_Value
0 id1 1 19
5 id1 2 41
1 id2 1 53
6 id2 2 56
2 id3 1 33
7 id3 2 68
3 id4 1 57
8 id4 2 67
4 id5 1 39
9 id5 2 49
As always, it's best to avoid groupby.apply(lambda x: ... as this is a slow loop over the groups.
N = 10000
df = pd.DataFrame({"RecordID": list(range(N))*10,
"number_of_days": np.repeat(range(10), N)})
def ALollz(df):
df = df.sort_values(['RecordID', 'number_of_days'])
arr = np.array_split(np.random.randint(1, 100, len(df)),
df.groupby('RecordID').size().cumsum()[:-1])
df['Random_Value'] = np.sort(arr, axis=1).ravel()
return df
%timeit ALollz(df)
#54 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(random_value=df.groupby('RecordID').transform(lambda x: np.sort(np.random.randint(1,100, len(x))))).sort_values('RecordID')
#15.9 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index()
#1.23 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With