Please point me in the right direction with this one. How can I convert a column that contains a continuous variable into a discrete variable? I have prices of financial instruments that I'm trying to convert into some kind of categorical value. I thought I could do the following.
labels = df['PRICE'].astype('category').cat.categories.tolist()
replace_map_comp = {'PRICE' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)
However, when I try to run a RandomForestClassifier over a subset of data, I'm getting an error.
from sklearn.ensemble import RandomForestClassifier
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
clf = RandomForestClassifier()
clf.fit(df[features], df1['PRICE'])
Error message reads: ValueError: Unknown label type: 'continuous'
I'm pretty sure this is close, but something is definitely off here.
CODE UPDATE BELOW:
# copy only numerics to new DF
df1 = df.select_dtypes(include=[np.number])
from sklearn import linear_model
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
reg = linear_model.LinearRegression()
reg.fit(df1[features], df1['PRICE'])
# problems start here...
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
Error: AttributeError: 'LinearRegression' object has no attribute 'feature_importances_'
Following concept from here:
http://blog.yhat.com/tutorials/5-Feature-Engineering.html
FYI, I tried the one-hot encoding and the code transformation grew the columns too large and I got an error. Maybe the way to handle this is to take a small subset of the data. With 250k rows, I'm guessing maybe 100k rows should be fairly representative of the entire data set. Maybe that's the way to go. Just thinking out loud here.
You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.
Pandas has a cut function that could work for what you're trying to do:
import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
n_bins = 5
df = pd.DataFrame(data=norm.rvs(loc=500, scale=50, size=100),
columns=['PRICE'])
y = label_encoder.fit_transform(pd.cut(df['PRICE'], n_bins, retbins=True)[0])
rfc = RandomForestClassifier(n_estimators=100, verbose=2)
rfc.fit(df[['PRICE']], y)
Here's a sample example. First know that there's a hundred different ways one could do this so this isn't the "correct" way necessarily; it's just one way.
Main idea: use Pandas cut
function to create buckets for the continuous data. The number of buckets is up to you to decide. I chose n_bins
as 5
in this example.
After you have the bins, they can be converted into classes with sklearn's LabelEncoder()
. That way, you can refer back to these classes in an easier way. It's like a storage system for your classes so you can keep track of them. Use label_encoder.classes_
to see the classes.
When you're done with these steps, y
will look like this:
array([1, 2, 2, 0, 2, 2, 0, 1, 3, 1, 1, 2, 1, 4, 4, 2, 3, 1, 1, 3, 2, 3,
2, 2, 2, 0, 2, 2, 4, 1, 3, 2, 1, 3, 3, 2, 1, 4, 3, 1, 1, 4, 2, 3,
3, 2, 1, 1, 3, 4, 3, 3, 3, 2, 1, 2, 3, 1, 3, 1, 2, 0, 1, 1, 2, 4,
1, 2, 2, 2, 0, 1, 0, 3, 3, 4, 2, 3, 3, 2, 3, 1, 3, 4, 2, 2, 2, 0,
0, 0, 2, 2, 0, 4, 2, 3, 2, 2, 2, 2])
You have now converted continuous data into classes and can now pass to RandomForestClassifier()
.
Classifiers are good where you are facing with classes of explained variable and prices are not classes unless you make sum exact classes:
df['CLASS'] = np.where( df.PRICE > 1000, 1, 0) # Classify price above 1000 or less
Regression methods are highly preferable in the cases of working with continues explained variables.
from sklearn import linear_model
reg = linear_model()
reg.fit(df[features], df['CLASS'])
One-hot encoding is one way to do it.
https://www.ritchieng.com/machinelearning-one-hot-encoding/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
It looks like this: Source: https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With