Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create categorical variable based on a numerical variable

My DataFrame hase one column:

import pandas as pd
list=[1,1,4,5,6,6,30,20,80,90]
df=pd.DataFrame({'col1':list})

How can I add one more column 'col2' that would contain categorical information in reference to col1:

if col1 > 0 and col1 <= 10 then col2 = 'xxx'
if col1 > 10 and col1 <= 50 then col2 = 'yyy'
if col1 > 50 then col2 = 'zzz'
like image 412
Klausos Klausos Avatar asked Sep 17 '15 15:09

Klausos Klausos


People also ask

How do you convert numerical data to categorical data?

At first thought, converting numeric data to categorical data seems like an easy problem. One simple approach would be to divide the raw source data into equal intervals. For example, for the data in the demo and Figure 2, the range is 78.0 - 60.0 = 18.0.

Can a numerical variable be categorical?

Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method. Discrete variables are numeric variables that have a countable number of values between any two values.

How do you create a categorical variable?

To create a categorical variable from the existing column, we use multiple if-else statements within the factor() function and give a value to a column if a certain condition is true, if none of the conditions are true we use the else value of the last statement.

How do you create a categorical variable from a continuous variable in R?

You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.


2 Answers

You could use pd.cut as follows:

df['col2'] = pd.cut(df['col1'], bins=[0, 10, 50, float('Inf')], labels=['xxx', 'yyy', 'zzz'])

Output:

   col1 col2
0     1  xxx
1     1  xxx
2     4  xxx
3     5  xxx
4     6  xxx
5     6  xxx
6    30  yyy
7    20  yyy
8    80  zzz
9    90  zzz
like image 110
DontDivideByZero Avatar answered Sep 24 '22 14:09

DontDivideByZero


You could first create a new column col2, and update its values based on the conditions:

df['col2'] = 'zzz'
df.loc[(df['col1'] > 0) & (df['col1'] <= 10), 'col2'] = 'xxx'
df.loc[(df['col1'] > 10) & (df['col1'] <= 50), 'col2'] = 'yyy'
print df

Output:

   col1 col2
0     1  xxx
1     1  xxx
2     4  xxx
3     5  xxx
4     6  xxx
5     6  xxx
6    30  yyy
7    20  yyy
8    80  zzz
9    90  zzz

Alternatively, you can also apply a function based on the column col1:

def func(x):
    if 0 < x <= 10:
        return 'xxx'
    elif 10 < x <= 50:
        return 'yyy'
    return 'zzz'

df['col2'] = df['col1'].apply(func)

and this will result in the same output.

The apply approach should be preferred in this case as it is much faster:

%timeit run() # packaged to run the first approach
# 100 loops, best of 3: 3.28 ms per loop
%timeit df['col2'] = df['col1'].apply(func)
# 10000 loops, best of 3: 187 µs per loop

However, when the size of the DataFrame is large, the built-in vectorized operations (i.e. with the masking approach) might be faster.

like image 28
YS-L Avatar answered Sep 20 '22 14:09

YS-L