Column with random, increasing numbers in pandas

Question

I have a dataframe of about 10,000 rows. Each record includes a Recordid and an integer that represents the number of days since the start for that RecordID. For example:

 df = pd.DataFrame( { 
"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] , 
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})
df
RecordID    number_of_days
0   id1     1
1   id2     1
2   id3     1
3   id4     1
4   id5     1
5   id1     2
6   id2     2
7   id3     2
8   id4     2
9   id5     2

I'd like to add a column that contains a random number between 1 and 100. But I need the random to be higher than the previous random value for each ID. This is an example of what I would like to get:

RecordID    number_of_days  random_value
0   id1     1               10
1   id1     2               13
2   id1     3               45
3   id1     4               50
4   id1     5               62
5   id1     6               80
6   id1     7               81
7   id1     8               82
8   id1     9               92
9   id1     10              99
10  id2     2               12
11  id2     4               31

I see posts about creating a field with random values. I'm not finding any that address the need for generating random values that increase, though.

Vaishali · Accepted Answer

You can generate random numbers, sort and then assign to df,

df = pd.DataFrame( { 

"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] , 
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})

df['random_value'] = np.sort(np.random.randint(1,100, len(df)))


    RecordID    number_of_days  random_value
0   id1         1               5
1   id2         1               7
2   id3         1               19
3   id4         1               34
4   id5         1               45
5   id1         2               53
6   id2         2               67
7   id3         2               72
8   id4         2               72
9   id5         2               80

Edit: If you want the random_value by group, you can group data by ID and then assign sorted random numbers,

df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index(name = 'random_value')

ALollz · Answer

Generate all random numbers, slice it properly based on the group sizes, sort each slice, and assign back. First we need to sort the DataFrame so that assignment occurs properly.

import numpy as np
import pandas as pd

df = df.sort_values('RecordID')

arr = np.array_split(np.random.randint(1, 100, len(df)),
                     df.groupby('RecordID').size().cumsum()[:-1])

df['Random_Value'] = np.sort(arr, axis=1).ravel()

Output

  RecordID  number_of_days  Random_Value
0      id1               1            19
5      id1               2            41
1      id2               1            53
6      id2               2            56
2      id3               1            33
7      id3               2            68
3      id4               1            57
8      id4               2            67
4      id5               1            39
9      id5               2            49

As always, it's best to avoid groupby.apply(lambda x: ... as this is a slow loop over the groups.

N = 10000
df = pd.DataFrame({"RecordID": list(range(N))*10,
                   "number_of_days": np.repeat(range(10), N)})

def ALollz(df):
    df = df.sort_values(['RecordID', 'number_of_days'])

    arr = np.array_split(np.random.randint(1, 100, len(df)),
                         df.groupby('RecordID').size().cumsum()[:-1])

    df['Random_Value'] = np.sort(arr, axis=1).ravel()

    return df

%timeit ALollz(df)
#54 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(random_value=df.groupby('RecordID').transform(lambda x: np.sort(np.random.randint(1,100, len(x))))).sort_values('RecordID')
#15.9 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index()
#1.23 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Column with random, increasing numbers in pandas

Tags:

random

pandas

dataframe

JamesMiller

2 Answers

Vaishali

Output

ALollz

Recent Activity

Donate For Us

Column with random, increasing numbers in pandas

Tags:

random

pandas

dataframe

JamesMiller

2 Answers

Vaishali

Output

ALollz

Related questions

Recent Activity

Donate For Us