I am trying to randomly assign values from one column in one dataframe, to another dataframe within 12 different categories (by agerange and gender). For example I have two dataframes; lets call one d1 and the other d2
d1:
index agerange gender income
0 2 1 56700
1 2 0 25600
2 4 0 3000
3 4 0 106000
4 3 0 200
5 3 0 43000
6 4 0 10000000
d2:
index agerange gender
0 3 0
1 2 0
2 4 0
3 4 0
I want to group both dataframes by agerange and gender i.e 0-1,2,3,4,5,6 & 1-1,2,3,4,5,6 then randomly chose one of the incomes within d1 and assign it to d2.
ie:
d1:
index agerange gender income
0 2 1 56700
1 2 0 25600
2 4 0 3000
3 4 0 106000
4 3 0 200
5 3 0 43000
6 4 0 10000000
d2:
index agerange gender income
0 3 0 200
1 2 0 25600
2 4 0 10000000
3 4 0 3000
A random selection of rows from a DataFrame can be achieved in different ways. Create a simple dataframe with dictionary of lists. Sample method returns a random sample of items from an axis of object and this object of same type as your caller. Example 2: Using parameter n, which selects n numbers of rows randomly.
Generate Random Integers under a Single DataFrame Column Here is a template that you may use to generate random integers under a single DataFrame column: import numpy as np import pandas as pd data = np.random.randint (lowest integer, highest integer, size=number of random integers) df = pd.DataFrame (data, columns= ['column name']) print (df)
With a given DataFrame, the sample will always fetch same rows. If random_state is None or np.random, then a randomly-initialized RandomState object is returned. Numpy chose how many index include for random selection and we can allow replacement.
Example 2: Using parameter n, which selects n numbers of rows randomly. Select n numbers of rows randomly using sample (n) or sample (n=n).
Option 1
An approach with np.random.choice
and pd.DataFrame.query
I'm making an implicit assumption that we replace randomly drawn values for every row.
def take_one(x):
q = 'agerange == {agerange} and gender == {gender}'.format(**x)
return np.random.choice(d1.query(q).income)
d2.assign(income=d2.apply(take_one, 1))
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 106000
3 4 0 106000
Option 2
Attempting to make it more efficient to call np.random.choice
once per group.
g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index)
d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 10000000
3 4 0 106000
Debugging and Setup
import pandas as pd
import numpy as np
d1 = pd.DataFrame({
'agerange': [2, 2, 4, 4, 3, 3, 4],
'gender': [1, 0, 0, 0, 0, 0, 0],
'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000]
}, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index')
)
d2 = pd.DataFrame(
{'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]},
pd.Index([0, 1, 2, 3], name='index')
)
g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index)
d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f))
agerange gender income
index
0 3 0 200
1 2 0 25600
2 4 0 106000
3 4 0 3000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With