Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign values randomly between dataframes

Tags:

python

pandas

I am trying to randomly assign values from one column in one dataframe, to another dataframe within 12 different categories (by agerange and gender). For example I have two dataframes; lets call one d1 and the other d2

  d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender 
 0     3        0      
 1     2        0      
 2     4        0      
 3     4        0      

I want to group both dataframes by agerange and gender i.e 0-1,2,3,4,5,6 & 1-1,2,3,4,5,6 then randomly chose one of the incomes within d1 and assign it to d2.

ie:

d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender  income
 0     3        0      200  
 1     2        0      25600 
 2     4        0      10000000
 3     4        0      3000
like image 818
stav Avatar asked Jul 31 '17 16:07

stav


People also ask

How to do a random selection of rows from a Dataframe?

A random selection of rows from a DataFrame can be achieved in different ways. Create a simple dataframe with dictionary of lists. Sample method returns a random sample of items from an axis of object and this object of same type as your caller. Example 2: Using parameter n, which selects n numbers of rows randomly.

How do you generate random integers from a Dataframe?

Generate Random Integers under a Single DataFrame Column Here is a template that you may use to generate random integers under a single DataFrame column: import numpy as np import pandas as pd data = np.random.randint (lowest integer, highest integer, size=number of random integers) df = pd.DataFrame (data, columns= ['column name']) print (df)

What is random_state in NumPy Dataframe?

With a given DataFrame, the sample will always fetch same rows. If random_state is None or np.random, then a randomly-initialized RandomState object is returned. Numpy chose how many index include for random selection and we can allow replacement.

How to select n numbers of rows randomly in a table?

Example 2: Using parameter n, which selects n numbers of rows randomly. Select n numbers of rows randomly using sample (n) or sample (n=n).


1 Answers

Option 1
An approach with np.random.choice and pd.DataFrame.query
I'm making an implicit assumption that we replace randomly drawn values for every row.

def take_one(x):
    q = 'agerange == {agerange} and gender == {gender}'.format(**x)
    return np.random.choice(d1.query(q).income)

d2.assign(income=d2.apply(take_one, 1))

       agerange  gender  income
index                          
0             3       0     200
1             2       0   25600
2             4       0  106000
3             4       0  106000

Option 2
Attempting to make it more efficient to call np.random.choice once per group.

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index)
d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)

       agerange  gender    income
index                            
0             3       0       200
1             2       0     25600
2             4       0  10000000
3             4       0    106000

Debugging and Setup

import pandas as pd
import numpy as np

d1 = pd.DataFrame({
        'agerange': [2, 2, 4, 4, 3, 3, 4],
        'gender': [1, 0, 0, 0, 0, 0, 0],
        'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000]
    }, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index')
)

d2 = pd.DataFrame(
    {'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]},
    pd.Index([0, 1, 2, 3], name='index')
)

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index)
d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f))

       agerange  gender  income
index                          
0             3       0     200
1             2       0   25600
2             4       0  106000
3             4       0    3000
like image 139
piRSquared Avatar answered Oct 21 '22 11:10

piRSquared