I am trying to convert all values in my dataset into categorical values, I want all numerical values to be categorized into low, average or high depends on their quantile values.
So if the value is lower than 25% of the series, it will be converted as "Low"
I have tried to use assign then applied a function that I provide:
def turn_into_categorical(row):
quantile_level = [.25, .5, .75]
for r in row:
cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
if r >= cut[.75]:
return "High"
elif r >= cut[.25] and r < cut[0.75]:
return "Average"
else:
return "Low"
refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)
However, the code doesn't work well. I have also tried it via iterrows but I wonder if there's a faster way to do it?
Here is the data that I wanted to convert, all number excluding Year and Month should be categorized into Low, Medium, High, depending on their quantile value.
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 6.0 1.0 0.0 3.0
1 2014 11 4.0 3.0 0.0 12.0
2 2014 12 3.0 5.0 0.0 11.0
3 2015 1 7.0 2.0 0.0 4.0
4 2015 2 5.0 5.0 0.0 10.0
5 2015 3 7.0 5.0 0.0 8.0
6 2015 4 4.0 1.0 0.0 6.0
7 2015 5 5.0 0.0 0.0 7.0
8 2015 6 4.0 1.0 0.0 6.0
9 2015 7 15.0 2.0 0.0 9.0
10 2015 8 10.0 7.0 0.0 9.0
11 2015 9 12.0 0.0 0.0 8.0
12 2015 10 12.0 0.0 0.0 5.0
13 2015 11 8.0 5.0 0.0 10.0
14 2015 12 5.0 7.0 0.0 3.0
Expected Outcome: (Example)
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 High Medium Low Medium
1 2014 11 Low Medium Low high
@Borek another easiest way is by using keyboard shortcut ctrl + space . It will select the whole column.
Looks like you want pd.qcut
, which does precisely that. From the docs:
Quantile-based discretization function
So you could apply
pd.qcut
along the columns of the dataframe starting from Central Equatoria
onwards, specifying the quantiles you want to use to bin the Series with q = [0, 0.25, 0.75, 1.0]
df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0],
labels =['low','medium','high'])
if not x.nunique() == 1 else 'low'))
Output
Central Equatoria Eastern Equatoria Gogrial Jonglei
0 medium low low low
1 low medium low high
2 low medium low high
3 medium medium low low
4 medium medium low high
5 medium medium low medium
6 low low low medium
7 medium low low medium
8 low low low medium
9 high medium low medium
10 high high low medium
11 high low low medium
12 high low low low
13 medium medium low high
14 medium high low low
One idea using pd.DataFrame.quantile
with pd.Series.cut
:
cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])
for col in df.iloc[:, 2:]:
bin_edges = quantiles[col]
# special case situations where all values are equal
if bin_edges.nunique() == 1:
df[col] = 'Low'
else:
df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)
Result:
print(df)
Year Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0 2014 10 Medium Low Low Low
1 2014 11 Low Medium Low High
2 2014 12 Low Medium Low High
3 2015 1 Medium Medium Low Low
4 2015 2 Medium Medium Low High
5 2015 3 Medium Medium Low Medium
6 2015 4 Low Low Low Medium
7 2015 5 Medium Low Low Medium
8 2015 6 Low Low Low Medium
9 2015 7 High Medium Low Medium
10 2015 8 High High Low Medium
11 2015 9 High Low Low Medium
12 2015 10 High Low Low Low
13 2015 11 Medium Medium Low High
14 2015 12 Medium High Low Low
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With