How to convert a continuous variable to a categorical variable?

Tags:

Please point me in the right direction with this one. How can I convert a column that contains a continuous variable into a discrete variable? I have prices of financial instruments that I'm trying to convert into some kind of categorical value. I thought I could do the following.

labels = df['PRICE'].astype('category').cat.categories.tolist()
replace_map_comp = {'PRICE' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)

However, when I try to run a RandomForestClassifier over a subset of data, I'm getting an error.

from sklearn.ensemble import RandomForestClassifier
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
clf = RandomForestClassifier()
clf.fit(df[features], df1['PRICE'])

Error message reads: ValueError: Unknown label type: 'continuous'

I'm pretty sure this is close, but something is definitely off here.

CODE UPDATE BELOW:

# copy only numerics to new DF
df1 = df.select_dtypes(include=[np.number])

from sklearn import linear_model
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
reg = linear_model.LinearRegression()
reg.fit(df1[features], df1['PRICE'])

# problems start here...
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

Error: AttributeError: 'LinearRegression' object has no attribute 'feature_importances_'

Following concept from here:

http://blog.yhat.com/tutorials/5-Feature-Engineering.html

FYI, I tried the one-hot encoding and the code transformation grew the columns too large and I got an error. Maybe the way to handle this is to take a small subset of the data. With 250k rows, I'm guessing maybe 100k rows should be fairly representative of the entire data set. Maybe that's the way to go. Just thinking out loud here.

803

asked Jul 15 '19 21:07

ASH

3 Answers

Pandas has a cut function that could work for what you're trying to do:

import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
n_bins = 5
df = pd.DataFrame(data=norm.rvs(loc=500, scale=50, size=100),
                  columns=['PRICE'])
y = label_encoder.fit_transform(pd.cut(df['PRICE'], n_bins, retbins=True)[0])
rfc = RandomForestClassifier(n_estimators=100, verbose=2)
rfc.fit(df[['PRICE']], y)

Here's a sample example. First know that there's a hundred different ways one could do this so this isn't the "correct" way necessarily; it's just one way.

Main idea: use Pandas cut function to create buckets for the continuous data. The number of buckets is up to you to decide. I chose n_bins as 5 in this example.

After you have the bins, they can be converted into classes with sklearn's LabelEncoder(). That way, you can refer back to these classes in an easier way. It's like a storage system for your classes so you can keep track of them. Use label_encoder.classes_ to see the classes.

When you're done with these steps, y will look like this:

array([1, 2, 2, 0, 2, 2, 0, 1, 3, 1, 1, 2, 1, 4, 4, 2, 3, 1, 1, 3, 2, 3,
       2, 2, 2, 0, 2, 2, 4, 1, 3, 2, 1, 3, 3, 2, 1, 4, 3, 1, 1, 4, 2, 3,
       3, 2, 1, 1, 3, 4, 3, 3, 3, 2, 1, 2, 3, 1, 3, 1, 2, 0, 1, 1, 2, 4,
       1, 2, 2, 2, 0, 1, 0, 3, 3, 4, 2, 3, 3, 2, 3, 1, 3, 4, 2, 2, 2, 0,
       0, 0, 2, 2, 0, 4, 2, 3, 2, 2, 2, 2])

You have now converted continuous data into classes and can now pass to RandomForestClassifier().

142

answered Oct 25 '22 09:10

Jarad

Classifiers are good where you are facing with classes of explained variable and prices are not classes unless you make sum exact classes:

df['CLASS'] = np.where( df.PRICE > 1000, 1, 0) # Classify price above 1000 or less

Regression methods are highly preferable in the cases of working with continues explained variables.

from sklearn import linear_model
reg = linear_model()
reg.fit(df[features], df['CLASS'])

answered Oct 25 '22 08:10

DeepBlue

One-hot encoding is one way to do it.

https://www.ritchieng.com/machinelearning-one-hot-encoding/

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

It looks like this: enter image description here Source: https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e

answered Oct 25 '22 08:10

ode2k

Related questions
                            
                                Pandas Merge and filter
                            
                                Question related to super() with __init__()
                            
                                Why do I not have to define the variable in a for loop using range(), but I do have to in a while loop in Python?
                            
                                How to crop multiple rectangles or squares from JPEG?
                            
                                How do I solve the leap year function in Python for Hackerrank?
                            
                                Read and dump [bracket, list] from and to yaml with python
                            
                                Is there a more pythonic way to write multiple comparisons
                            
                                PySpark explode stringified array of dictionaries into rows
                            
                                ModuleNotFoundError when using importlib.import_module
                            
                                Pandas Timestamp rounds 30 seconds inconsistently
                            
                                How to create a Pandas DataFrame from dictionary of dataframes?
                            
                                Perform operations after styling in a dataframe
                            
                                Missing values in Pandas Pivot table?
                            
                                Optimizing suggestions for a piece of Julia and Python code
                            
                                Remove string element in a list of strings if the first characters match with another string element in the list
                            
                                DiGraph parallel ordering
                            
                                Drop rows in pandas if records in two columns do not appear together at least twice in the dataset
                            
                                Django Rest Framework Custom JWT authentication
                            
                                How to fetch a product from woocommerce api based on the sku?
                            
                                Pulling Zillow Rent Data from Zillow API

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a continuous variable to a categorical variable?

Tags:

python

python-3.x

scikit-learn

random-forest

ASH

People also ask

3 Answers

Jarad

DeepBlue

ode2k

Recent Activity

Donate For Us