Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly append "Yes/No" (ratio of 7:3) to a column in pandas dataframe?

I have a dataframe , which consists of three columns. And i want to append "Yes" or "No" to one of the column using python-pandas. Also the ratio between Yes:No is 7:3.

Had anyone tried this??

like image 229
Chandu codes Avatar asked May 19 '16 18:05

Chandu codes


People also ask

How do you drop the first 3 rows in pandas?

To delete the first three rows of a DataFrame in Pandas, we can use the iloc() method.


4 Answers

With numpy's random.choice:

df["new_column"] = np.random.choice(["Yes", "No"], len(df), p=[0.7, 0.3])

Note: np.random.choice consists of independent trials (unless you pass replace = False). In each trial, the probability of getting a "Yes" will be 0.7. In the end you might not end up exactly with a 70% ratio. However, with 2480500 rows this binomial distribution will approximate to a normal distribution with a mean 2480500 * 0.7 and a standard deviation sqrt(2480500 * 0.7 * 0.3). With +/-3 standard deviation (with 99.73% probability) you will end up with a ratio between (0.69913, 0.70087). But if you want exactly 70%, you can use pandas' sample as @EdChum suggested, I guess it has a correction factor.

like image 165
ayhan Avatar answered Oct 14 '22 07:10

ayhan


You can use sample to achieve this:

In [11]:
df = pd.DataFrame(np.random.randn(20,3), columns=list('abc'))
df

Out[11]:
           a         b         c
0  -0.267704  1.030417 -0.494542
1  -0.830801  0.421847  1.296952
2  -1.165387 -0.381976 -0.178988
3  -0.800799 -0.240998 -0.900573
4   0.855965  0.765313 -0.125862
5   1.153730  1.323783 -0.113135
6   0.242592 -2.137141 -0.230177
7  -0.451582  0.267415  1.006564
8   0.071916  0.476523  1.326859
9  -1.168084  0.250367 -1.235262
10  0.238183  0.391661 -1.177926
11 -1.153294 -0.304811 -0.955384
12 -0.984470 -0.351073 -1.155049
13 -2.068388  1.294905  0.892136
14 -0.196381 -1.083988  0.203369
15 -1.430208  0.859933  1.152462
16 -0.250452  0.824815  0.425096
17  1.051399 -1.199689  0.487980
18  0.688910 -0.664028 -0.097302
19 -0.355774  0.064857  0.003731

In [12]:    
df.loc[df.index.to_series().sample(frac=0.7).index, 'new_col'] = 'Yes'
df['new_col'].fillna('No',inplace=True)
df

Out[12]:
           a         b         c new_col
0  -0.267704  1.030417 -0.494542     Yes
1  -0.830801  0.421847  1.296952     Yes
2  -1.165387 -0.381976 -0.178988      No
3  -0.800799 -0.240998 -0.900573      No
4   0.855965  0.765313 -0.125862      No
5   1.153730  1.323783 -0.113135     Yes
6   0.242592 -2.137141 -0.230177     Yes
7  -0.451582  0.267415  1.006564     Yes
8   0.071916  0.476523  1.326859      No
9  -1.168084  0.250367 -1.235262     Yes
10  0.238183  0.391661 -1.177926     Yes
11 -1.153294 -0.304811 -0.955384     Yes
12 -0.984470 -0.351073 -1.155049     Yes
13 -2.068388  1.294905  0.892136     Yes
14 -0.196381 -1.083988  0.203369      No
15 -1.430208  0.859933  1.152462     Yes
16 -0.250452  0.824815  0.425096     Yes
17  1.051399 -1.199689  0.487980     Yes
18  0.688910 -0.664028 -0.097302     Yes
19 -0.355774  0.064857  0.003731      No

Basically you can call sample and pass param frac=0.7 and then use the index to mask the df and assign the 'yes' value and then call fillna to assign the 'no' values

like image 35
EdChum Avatar answered Oct 14 '22 06:10

EdChum


import pandas as pd
import random

arr = ['Yes'] * 7 + ['No'] * 3
arr *= number_of_rows // 10

random.shuffle(arr)

df['column_name'] = arr
like image 2
Vedang Mehta Avatar answered Oct 14 '22 07:10

Vedang Mehta


Quick and Dirty

pd.Series(np.random.rand(100)).apply(lambda x: 'Yes' if x < .7 else 'No')
like image 1
piRSquared Avatar answered Oct 14 '22 06:10

piRSquared