How to do discretization of continuous attributes in sklearn?

Tags:

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a super store chain that operates stores in different cities)

city,avg_income_in_city,population,square_feet_of_store_area,  store_type ,avg_revenue
NY  ,54504            , 3506908   ,3006                       ,INDOOR    , 8000091
CH  ,44504            , 2505901   ,4098                       ,INDOOR    , 4000091
HS  ,50134            , 3206911   ,1800                       ,KIOSK     , 7004567
NY  ,54504            , 3506908   ,1000                       ,KIOSK     , 2000091

Her you can see that avg_income_in_city, square_feet_of_store_area and avg_revenue are continuous values where as city,store_type etc are categorical classes (and few more which I have not shown here to maintain the brevity of the data).

I wish to model the data in order to predict the revenue. The question is how to 'Discretizate' the continuous values using sklearn? Does sklearn provide any "readymade" class/method for Discretization of the continuous values? (like we have in Orange e.g Orange.Preprocessor_discretize(data, method=orange.EntropyDiscretization())

Thanks !

447

asked Apr 24 '14 11:04

data_learner

1 Answers

Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:

Uniformly-sized bins
Bins with "equal" numbers of samples inside (as much as possible)
Bins based on K-means clustering

Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:

import numpy as np
import pandas as pd
n_samples = 10
a = np.random.randint(0, 10, n_samples)

# say you want to split at 1 and 3
boundaries = [1, 3]
# add min and max values of your data
boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))

a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
print(a, '\n')
print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')

which produces:

[2 2 9 7 2 9 3 0 4 0]

[[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
 category

[1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
Categories (3, int64): [0 < 1 < 2]
 category

[1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
 float64

Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).

My example uses integer data, but it should work just as fine with floats.

answered Oct 26 '22 08:10

marcotama

Related questions
                            
                                Scikit-learn using GridSearchCV on DecisionTreeClassifier
                            
                                No module named 'sklearn.neighbors._base'
                            
                                AttributeError when using ColumnTransformer into a pipeline
                            
                                how to print estimated coefficients after a (GridSearchCV) fit a model? (SGDRegressor)
                            
                                How to perform under sampling in scikit learn?
                            
                                How to set custom stop words for sklearn CountVectorizer?
                            
                                XGBOOST: sample_Weights vs scale_pos_weight
                            
                                displaying scikit decision tree figure in jupyter notebook
                            
                                How should I vectorize the following list of lists with scikit learn?
                            
                                Can the Precision, Recall and F1 be the same value?
                            
                                How does parameters 'c' and 'cmap' behave in a matplotlib scatter plot?
                            
                                How to use mahalanobis distance in sklearn DistanceMetrics?
                            
                                Understanding Text feature extraction TfidfVectorizer in python scikit-learn
                            
                                KL-Divergence of two GMMs
                            
                                ImportError: cannot import name cross_validation
                            
                                CountVectorizer: "I" not showing up in vectorized text
                            
                                How do i visualize data points of tf-idf vectors for kmeans clustering?
                            
                                XGboost python - classifier class weight option?
                            
                                Pandas and scikit-learn: KeyError: [....] not in index
                            
                                Sklearn: Cross validation for grouped data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do discretization of continuous attributes in sklearn?

Tags:

scikit-learn

discretization

data_learner

People also ask

1 Answers

marcotama

Recent Activity

Donate For Us