Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a continuous variable to a categorical variable?

Please point me in the right direction with this one. How can I convert a column that contains a continuous variable into a discrete variable? I have prices of financial instruments that I'm trying to convert into some kind of categorical value. I thought I could do the following.

labels = df['PRICE'].astype('category').cat.categories.tolist()
replace_map_comp = {'PRICE' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)

However, when I try to run a RandomForestClassifier over a subset of data, I'm getting an error.

from sklearn.ensemble import RandomForestClassifier
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
clf = RandomForestClassifier()
clf.fit(df[features], df1['PRICE'])

Error message reads: ValueError: Unknown label type: 'continuous'

I'm pretty sure this is close, but something is definitely off here.

CODE UPDATE BELOW:

# copy only numerics to new DF
df1 = df.select_dtypes(include=[np.number])

from sklearn import linear_model
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
reg = linear_model.LinearRegression()
reg.fit(df1[features], df1['PRICE'])

# problems start here...
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

Error: AttributeError: 'LinearRegression' object has no attribute 'feature_importances_'

Following concept from here:

http://blog.yhat.com/tutorials/5-Feature-Engineering.html

FYI, I tried the one-hot encoding and the code transformation grew the columns too large and I got an error. Maybe the way to handle this is to take a small subset of the data. With 250k rows, I'm guessing maybe 100k rows should be fairly representative of the entire data set. Maybe that's the way to go. Just thinking out loud here.

like image 803
ASH Avatar asked Jul 15 '19 21:07

ASH


People also ask

How do you convert continuous variables to categorical variables?

You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.


3 Answers

Pandas has a cut function that could work for what you're trying to do:

import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
n_bins = 5
df = pd.DataFrame(data=norm.rvs(loc=500, scale=50, size=100),
                  columns=['PRICE'])
y = label_encoder.fit_transform(pd.cut(df['PRICE'], n_bins, retbins=True)[0])
rfc = RandomForestClassifier(n_estimators=100, verbose=2)
rfc.fit(df[['PRICE']], y)

Here's a sample example. First know that there's a hundred different ways one could do this so this isn't the "correct" way necessarily; it's just one way.

Main idea: use Pandas cut function to create buckets for the continuous data. The number of buckets is up to you to decide. I chose n_bins as 5 in this example.

After you have the bins, they can be converted into classes with sklearn's LabelEncoder(). That way, you can refer back to these classes in an easier way. It's like a storage system for your classes so you can keep track of them. Use label_encoder.classes_ to see the classes.

When you're done with these steps, y will look like this:

array([1, 2, 2, 0, 2, 2, 0, 1, 3, 1, 1, 2, 1, 4, 4, 2, 3, 1, 1, 3, 2, 3,
       2, 2, 2, 0, 2, 2, 4, 1, 3, 2, 1, 3, 3, 2, 1, 4, 3, 1, 1, 4, 2, 3,
       3, 2, 1, 1, 3, 4, 3, 3, 3, 2, 1, 2, 3, 1, 3, 1, 2, 0, 1, 1, 2, 4,
       1, 2, 2, 2, 0, 1, 0, 3, 3, 4, 2, 3, 3, 2, 3, 1, 3, 4, 2, 2, 2, 0,
       0, 0, 2, 2, 0, 4, 2, 3, 2, 2, 2, 2])

You have now converted continuous data into classes and can now pass to RandomForestClassifier().

like image 142
Jarad Avatar answered Oct 25 '22 09:10

Jarad


Classifiers are good where you are facing with classes of explained variable and prices are not classes unless you make sum exact classes:

df['CLASS'] = np.where( df.PRICE > 1000, 1, 0) # Classify price above 1000 or less

Regression methods are highly preferable in the cases of working with continues explained variables.

from sklearn import linear_model
reg = linear_model()
reg.fit(df[features], df['CLASS'])
like image 29
DeepBlue Avatar answered Oct 25 '22 08:10

DeepBlue


One-hot encoding is one way to do it.

https://www.ritchieng.com/machinelearning-one-hot-encoding/

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

It looks like this: enter image description here Source: https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e

like image 31
ode2k Avatar answered Oct 25 '22 08:10

ode2k