I am trying to standardize (mean = 0, std = 1) one column ('age') in my data frame. Below is my code in Spark (Python):
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
# Make my 'age' column an assembler type:
age_assembler = VectorAssembler(inputCols= ['age'], outputCol = "age_feature")
# Create a scaler that takes 'age_feature' as an input column:
scaler = StandardScaler(inputCol="age_feature", outputCol="age_scaled",
withStd=True, withMean=True)
# Creating a mini-pipeline for those 2 steps:
age_pipeline = Pipeline(stages=[age_assembler, scaler])
scaled = age_pipeline.fit(sample17)
sample17_scaled = scaled.transform(sample17)
type(sample17_scaled)
It seems to run just fine. And the very last line produces: "sample17_scaled:pyspark.sql.dataframe.DataFrame"
But when I run the line below it shows that the new column age_scaled is of type 'vector': |-- age_scaled: vector (nullable = true)
sample17_scaled.printSchema()
How can I calcualate anything using this new column? For example, I can't calculate a mean. When I try, it says it should be 'long' and not udt.
Thank you very much!
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
With standardization, what we can do is map the data into a range of negative one to one with a mean of zero. And we do this because some machine learning algorithms, such as support vector machines, and some linear models work better when all of the features have a unit variance and a zero mean.
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance. New in version 1.4.
StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.
Some machine learning practitioners tend to standardize their data blindly before each machine learning model without making the effort to understand why it should be used, or even whether it is needed or not. So you need to understand when you should use the StandardScaler to scale your data. When and How To Use StandardScaler?
Python sklearn library offers us with StandardScaler () function to standardize the data values into a standard format. According to the above syntax, we initially create an object of the StandardScaler () function. Further, we use fit_transform () along with the assigned object to transform the data and standardize it.
StandardScaler removes the mean and scales the data to the unit variance. However, outliers have an influence when calculating the empirical mean and standard deviation, which narrows the range of characteristic values.
Let us now try to implement the concept of Standardization in the upcoming sections. Python sklearn library offers us with StandardScaler () function to standardize the data values into a standard format. According to the above syntax, we initially create an object of the StandardScaler () function.
Just use plain aggregation:
from pyspark.sql.functions import stddev, mean, col
sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")
(sample17
.select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
.crossJoin(sample17)
.withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))
# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# | 2.0| 1.0| 1| -1.0|
# | 2.0| 1.0| 2| 0.0|
# | 2.0| 1.0| 3| 1.0|
# +--------+----------+---+----------+
or
mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)
# +---+----------+
# |age|age_scaled|
# +---+----------+
# | 1| -1.0|
# | 2| 0.0|
# | 3| 1.0|
# +---+----------+
If you want Transformer
you can split vector into columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With