Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there no "inverse_transform" method for a scaler like MinMaxScaler in spark?

When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset.

After we got a trained model and use it to make predictions, and scale back the predictions to the original representation.

In python, there is "inverse_transform" method. For example:

from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)

dataScaled = scaler.fit(data).transform(data)
print(dataScaled)

scaler.inverse_transform(dataScaled)

Is there similar method in spark?

I have googled a lot, but found no answer. Can anyone give me some suggestions? Thank you very much!

like image 388
Levine Huang Avatar asked Sep 07 '17 08:09

Levine Huang


People also ask

How does inverse transformation work in minmaxscaler?

Every time you make a model.predict (X) it will apply an inverse transformation so that your predictions are in the same scale as prior MinMaxScaler Thanks for contributing an answer to Data Science Stack Exchange!

What is minmaxscaler in sklearn?

MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the values to a specific value range without changing the shape of the original distribution. The MinMax scaling is done using: Sklearn preprocessing defines MinMaxScaler () method to achieve this.

What is minmax scaler in Python?

MinMax Scaler There is another way of data scaling, where the minimum of feature is made equal to zero and the maximum of feature equal to one. MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling features to a given range.

How do I use minmaxscaler to transform a dataset?

First, a MinMaxScaler instance is defined with default hyperparameters. Once defined, we can call the fit_transform () function and pass it to our dataset to create a transformed version of our dataset.


1 Answers

In our company, in order to solve the same problem on the StandardScaler, we extended spark.ml with this (among other things):

package org.apache.spark.ml

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable

package object feature {

    implicit class RichStandardScalerModel(model: StandardScalerModel) {

        private def invertedStdDev(sigma: Double): Double = 1 / sigma

        private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma

        def inverse(newOutputCol: String): StandardScalerModel = {
            val sigma: linalg.Vector = model.std
            val mu: linalg.Vector = model.mean
            val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
            val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
            val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
                .setInputCol(model.getOutputCol)
                .setOutputCol(newOutputCol)

            inverted
                .set(inverted.withMean, model.getWithMean)
                .set(inverted.withStd, model.getWithStd)
        }
    }

}

It should be fairly easy to modify it or do something similar for your specific case.

Keep in mind that due to JVM's double implementation, you normally lose precision in these operations, so you will not recover the exact original values you had before the transformation (e.g.: you will probably get something like 1.9999999999999998 instead of 2.0).

like image 54
Simone Grandi Avatar answered Sep 25 '22 08:09

Simone Grandi