Is there any package in Python that does data transformation: scaling, centering and Box-Cox transformation to eliminate skewness of data?
In R this could be done using caret
package:
set.seed(1)
predictors = data.frame(x1 = rnorm(1000,
mean = 5,
sd = 2),
x2 = rexp(1000,
rate=10))
require(caret)
trans = preProcess(predictors,
c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
trans = predict(trans, predictors))
I know about sklearn
, but I was unable to find the above-mentioned processing functions.
For scaling and centering you can use preprocessing
from sklearn
:
from sklearn import preprocessing
centered_scaled_data = preprocessing.scale(original_data)
For Box-Cox you can use boxcox
from scipy
:
from scipy.stats import boxcox
boxcox_transformed_data = boxcox(original_data)
For calculation of skewness you can use skew
from scipy
:
from scipy.stats import skew
skness = skew(original_data)
You can read more details about Resolving Skewness in this post. Also, you can find more details about Centering & Scaling here.
Now scikit-learn has a method to do what you want. This provides a familiar API and is easy to put into pipelines.
sklearn version 0.20.0 has a Box-Cox transformation available through the power_transform
method. This method applies Box-Cox and then applies zero-mean, unit-variance normalization to the data. You can edit the default normalization with (standardize=False
).
sklearn.preprocessing.power_transform(X, method=’box-cox’, standardize=True, copy=True)
Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, power_transform() supports the Box-Cox transform. Box-Cox requires input data to be strictly positive. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.power_transform.html
The primary docs page doesn't mention it, but power_transform
also supports Yeo-Johnson transformation.
The docs also have a nice explanation here: http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-transformer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With