Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use KBinsDiscretizer to make continuous data into bins in Sklearn?

I am working on a ML algorithm in which I tried to convert the continuous target values into small bins to understand the problem better. Hence to make better prediction. My original problem is for regression but I convert into classification by making small bins with labels.

I did as follow,

from sklearn.preprocessing import KBinsDiscretizer  
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
s = est.fit(target) 
Xt = est.transform(s)

It shows a value error like below. Then I reshaped my data into 2D. yet I could not solve it.

ValueError: Expected 2D array, got 1D array instead:

from sklearn.preprocessing import KBinsDiscretizer

myData = pd.read_csv("train.csv", delimiter=",")
target = myData.iloc[:,-5]  # this is a continuous data which must be 
                        # converted into bins with a new column.

xx = target.values.reshape(21263,1)

est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
s = est.fit(xx) 
Xt = est.transform(s)

You can see my target has 21263 rows. I have to divide these into 10 equal bins and write it into a a new column in my dataframe. Thanks for the guidance.

P.S.: Max target value:185.0
Min target value:0.00021

like image 585
Mass17 Avatar asked Dec 28 '18 19:12

Mass17


People also ask

How do you Discretize continuous values?

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function. Continuous data is Measured, while Discrete data is Counted.

How do you discretize continuous data in Python?

We can use NumPy's digitize() function to discretize the quantitative variable. Let us consider a simple binning, where we use 50 as threshold to bin our data into two categories. One with values less than 50 are in the 0 category and the ones above 50 are in the 1 category.

Why does kbinsdiscretizer produce constant features?

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold).

Can I combine kbinsdiscretizer with columntransformer?

You can combine KBinsDiscretizer with ColumnTransformer if you only want to preprocess part of the features. KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold ).

What happens if the bins are too wide for discretization?

In the example, we discretize the feature and one-hot encode the transformed data. Note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretizer parameters should usually be tuned under cross validation.

What are the characteristics of bins in a k-means cluster?

Return the bin identifier encoded as an integer value. Strategy used to define the widths of the bins. All bins in each feature have identical widths. All bins in each feature have the same number of points. Values in each bin have the same nearest center of a 1D k-means cluster.


1 Answers

Okay I was able to solve it. In any case I post the answer if anyone else need this in the future. I used pandas.qcut

target['Temp_class'] = pd.qcut(target['Temeratue'], 10, labels=False)

This has solved my problem.

like image 74
Mass17 Avatar answered Sep 19 '22 01:09

Mass17