I am working on a ML algorithm in which I tried to convert the continuous target values into small bins to understand the problem better. Hence to make better prediction. My original problem is for regression but I convert into classification by making small bins with labels.
I did as follow,
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
s = est.fit(target)
Xt = est.transform(s)
It shows a value error like below. Then I reshaped my data into 2D. yet I could not solve it.
ValueError: Expected 2D array, got 1D array instead:
from sklearn.preprocessing import KBinsDiscretizer
myData = pd.read_csv("train.csv", delimiter=",")
target = myData.iloc[:,-5] # this is a continuous data which must be
# converted into bins with a new column.
xx = target.values.reshape(21263,1)
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
s = est.fit(xx)
Xt = est.transform(s)
You can see my target has 21263 rows. I have to divide these into 10 equal bins and write it into a a new column in my dataframe. Thanks for the guidance.
P.S.:
Max target value:185.0
Min target value:0.00021
Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function. Continuous data is Measured, while Discrete data is Counted.
We can use NumPy's digitize() function to discretize the quantitative variable. Let us consider a simple binning, where we use 50 as threshold to bin our data into two categories. One with values less than 50 are in the 0 category and the ones above 50 are in the 1 category.
KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold).
You can combine KBinsDiscretizer with ColumnTransformer if you only want to preprocess part of the features. KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold ).
In the example, we discretize the feature and one-hot encode the transformed data. Note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretizer parameters should usually be tuned under cross validation.
Return the bin identifier encoded as an integer value. Strategy used to define the widths of the bins. All bins in each feature have identical widths. All bins in each feature have the same number of points. Values in each bin have the same nearest center of a 1D k-means cluster.
Okay I was able to solve it. In any case I post the answer if anyone else need this in the future. I used pandas.qcut
target['Temp_class'] = pd.qcut(target['Temeratue'], 10, labels=False)
This has solved my problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With