I have a dataframe , which consists of three columns. And i want to append "Yes" or "No" to one of the column using python-pandas. Also the ratio between Yes:No is 7:3.
Had anyone tried this??
To delete the first three rows of a DataFrame in Pandas, we can use the iloc() method.
With numpy's random.choice
:
df["new_column"] = np.random.choice(["Yes", "No"], len(df), p=[0.7, 0.3])
Note: np.random.choice consists of independent trials (unless you pass replace = False
). In each trial, the probability of getting a "Yes" will be 0.7. In the end you might not end up exactly with a 70% ratio. However, with 2480500 rows this binomial distribution will approximate to a normal distribution with a mean 2480500 * 0.7
and a standard deviation sqrt(2480500 * 0.7 * 0.3)
. With +/-3 standard deviation
(with 99.73% probability) you will end up with a ratio between (0.69913, 0.70087)
. But if you want exactly 70%, you can use pandas' sample as @EdChum suggested, I guess it has a correction factor.
You can use sample
to achieve this:
In [11]:
df = pd.DataFrame(np.random.randn(20,3), columns=list('abc'))
df
Out[11]:
a b c
0 -0.267704 1.030417 -0.494542
1 -0.830801 0.421847 1.296952
2 -1.165387 -0.381976 -0.178988
3 -0.800799 -0.240998 -0.900573
4 0.855965 0.765313 -0.125862
5 1.153730 1.323783 -0.113135
6 0.242592 -2.137141 -0.230177
7 -0.451582 0.267415 1.006564
8 0.071916 0.476523 1.326859
9 -1.168084 0.250367 -1.235262
10 0.238183 0.391661 -1.177926
11 -1.153294 -0.304811 -0.955384
12 -0.984470 -0.351073 -1.155049
13 -2.068388 1.294905 0.892136
14 -0.196381 -1.083988 0.203369
15 -1.430208 0.859933 1.152462
16 -0.250452 0.824815 0.425096
17 1.051399 -1.199689 0.487980
18 0.688910 -0.664028 -0.097302
19 -0.355774 0.064857 0.003731
In [12]:
df.loc[df.index.to_series().sample(frac=0.7).index, 'new_col'] = 'Yes'
df['new_col'].fillna('No',inplace=True)
df
Out[12]:
a b c new_col
0 -0.267704 1.030417 -0.494542 Yes
1 -0.830801 0.421847 1.296952 Yes
2 -1.165387 -0.381976 -0.178988 No
3 -0.800799 -0.240998 -0.900573 No
4 0.855965 0.765313 -0.125862 No
5 1.153730 1.323783 -0.113135 Yes
6 0.242592 -2.137141 -0.230177 Yes
7 -0.451582 0.267415 1.006564 Yes
8 0.071916 0.476523 1.326859 No
9 -1.168084 0.250367 -1.235262 Yes
10 0.238183 0.391661 -1.177926 Yes
11 -1.153294 -0.304811 -0.955384 Yes
12 -0.984470 -0.351073 -1.155049 Yes
13 -2.068388 1.294905 0.892136 Yes
14 -0.196381 -1.083988 0.203369 No
15 -1.430208 0.859933 1.152462 Yes
16 -0.250452 0.824815 0.425096 Yes
17 1.051399 -1.199689 0.487980 Yes
18 0.688910 -0.664028 -0.097302 Yes
19 -0.355774 0.064857 0.003731 No
Basically you can call sample
and pass param frac=0.7
and then use the index to mask the df and assign the 'yes' value and then call fillna
to assign the 'no' values
import pandas as pd
import random
arr = ['Yes'] * 7 + ['No'] * 3
arr *= number_of_rows // 10
random.shuffle(arr)
df['column_name'] = arr
pd.Series(np.random.rand(100)).apply(lambda x: 'Yes' if x < .7 else 'No')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With