My DataFrame hase one column:
import pandas as pd
list=[1,1,4,5,6,6,30,20,80,90]
df=pd.DataFrame({'col1':list})
How can I add one more column 'col2' that would contain categorical information in reference to col1:
if col1 > 0 and col1 <= 10 then col2 = 'xxx'
if col1 > 10 and col1 <= 50 then col2 = 'yyy'
if col1 > 50 then col2 = 'zzz'
At first thought, converting numeric data to categorical data seems like an easy problem. One simple approach would be to divide the raw source data into equal intervals. For example, for the data in the demo and Figure 2, the range is 78.0 - 60.0 = 18.0.
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method. Discrete variables are numeric variables that have a countable number of values between any two values.
To create a categorical variable from the existing column, we use multiple if-else statements within the factor() function and give a value to a column if a certain condition is true, if none of the conditions are true we use the else value of the last statement.
You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.
You could use pd.cut
as follows:
df['col2'] = pd.cut(df['col1'], bins=[0, 10, 50, float('Inf')], labels=['xxx', 'yyy', 'zzz'])
Output:
col1 col2
0 1 xxx
1 1 xxx
2 4 xxx
3 5 xxx
4 6 xxx
5 6 xxx
6 30 yyy
7 20 yyy
8 80 zzz
9 90 zzz
You could first create a new column col2
, and update its values based on the conditions:
df['col2'] = 'zzz'
df.loc[(df['col1'] > 0) & (df['col1'] <= 10), 'col2'] = 'xxx'
df.loc[(df['col1'] > 10) & (df['col1'] <= 50), 'col2'] = 'yyy'
print df
Output:
col1 col2
0 1 xxx
1 1 xxx
2 4 xxx
3 5 xxx
4 6 xxx
5 6 xxx
6 30 yyy
7 20 yyy
8 80 zzz
9 90 zzz
Alternatively, you can also apply a function based on the column col1
:
def func(x):
if 0 < x <= 10:
return 'xxx'
elif 10 < x <= 50:
return 'yyy'
return 'zzz'
df['col2'] = df['col1'].apply(func)
and this will result in the same output.
The apply
approach should be preferred in this case as it is much faster:
%timeit run() # packaged to run the first approach
# 100 loops, best of 3: 3.28 ms per loop
%timeit df['col2'] = df['col1'].apply(func)
# 10000 loops, best of 3: 187 µs per loop
However, when the size of the DataFrame is large, the built-in vectorized operations (i.e. with the masking approach) might be faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With