Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python library for data scaling, centering and Box-Cox transformation

Is there any package in Python that does data transformation: scaling, centering and Box-Cox transformation to eliminate skewness of data? In R this could be done using caret package:

set.seed(1)
predictors = data.frame(x1 = rnorm(1000,
                                   mean = 5,
                                   sd = 2),
                        x2 = rexp(1000,
                                  rate=10))

require(caret)

trans = preProcess(predictors, 
                   c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
      trans = predict(trans, predictors))

I know about sklearn, but I was unable to find the above-mentioned processing functions.

like image 349
Klausos Klausos Avatar asked Nov 26 '15 17:11

Klausos Klausos


2 Answers

For scaling and centering you can use preprocessing from sklearn:

from sklearn import preprocessing
centered_scaled_data = preprocessing.scale(original_data)

For Box-Cox you can use boxcox from scipy:

from scipy.stats import boxcox
boxcox_transformed_data = boxcox(original_data)

For calculation of skewness you can use skew from scipy:

from scipy.stats import skew
skness = skew(original_data)

You can read more details about Resolving Skewness in this post. Also, you can find more details about Centering & Scaling here.

like image 73
Shahram Avatar answered Nov 06 '22 20:11

Shahram


Now scikit-learn has a method to do what you want. This provides a familiar API and is easy to put into pipelines.

sklearn version 0.20.0 has a Box-Cox transformation available through the power_transform method. This method applies Box-Cox and then applies zero-mean, unit-variance normalization to the data. You can edit the default normalization with (standardize=False).

sklearn.preprocessing.power_transform(X, method=’box-cox’, standardize=True, copy=True)

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, power_transform() supports the Box-Cox transform. Box-Cox requires input data to be strictly positive. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.power_transform.html

The primary docs page doesn't mention it, but power_transform also supports Yeo-Johnson transformation.

The docs also have a nice explanation here: http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-transformer

like image 43
jeffhale Avatar answered Nov 06 '22 21:11

jeffhale