Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to standardize ONE column in Spark using StandardScaler?

I am trying to standardize (mean = 0, std = 1) one column ('age') in my data frame. Below is my code in Spark (Python):

from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

# Make my 'age' column an assembler type:
age_assembler = VectorAssembler(inputCols= ['age'], outputCol = "age_feature")

# Create a scaler that takes 'age_feature' as an input column:
scaler = StandardScaler(inputCol="age_feature", outputCol="age_scaled",
                        withStd=True, withMean=True)

# Creating a mini-pipeline for those 2 steps:
age_pipeline = Pipeline(stages=[age_assembler, scaler])
scaled = age_pipeline.fit(sample17)
sample17_scaled = scaled.transform(sample17)
type(sample17_scaled)

It seems to run just fine. And the very last line produces: "sample17_scaled:pyspark.sql.dataframe.DataFrame"

But when I run the line below it shows that the new column age_scaled is of type 'vector': |-- age_scaled: vector (nullable = true)

sample17_scaled.printSchema()

How can I calcualate anything using this new column? For example, I can't calculate a mean. When I try, it says it should be 'long' and not udt.

Thank you very much!

like image 599
user3245256 Avatar asked Dec 03 '17 22:12

user3245256


People also ask

How do I select specific columns in spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

What is standardization in spark?

With standardization, what we can do is map the data into a range of negative one to one with a mean of zero. And we do this because some machine learning algorithms, such as support vector machines, and some linear models work better when all of the features have a unit variance and a zero mean.

How do you use the standard scaler in Pyspark?

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance. New in version 1.4.

How does a standard scaler work?

StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.

Should you use the standardscaler to scale your data?

Some machine learning practitioners tend to standardize their data blindly before each machine learning model without making the effort to understand why it should be used, or even whether it is needed or not. So you need to understand when you should use the StandardScaler to scale your data. When and How To Use StandardScaler?

How to use standardscaler() function in Python sklearn?

Python sklearn library offers us with StandardScaler () function to standardize the data values into a standard format. According to the above syntax, we initially create an object of the StandardScaler () function. Further, we use fit_transform () along with the assigned object to transform the data and standardize it.

What is the difference between standardscaler and standard deviation?

StandardScaler removes the mean and scales the data to the unit variance. However, outliers have an influence when calculating the empirical mean and standard deviation, which narrows the range of characteristic values.

How to standardize data in Python sklearn?

Let us now try to implement the concept of Standardization in the upcoming sections. Python sklearn library offers us with StandardScaler () function to standardize the data values into a standard format. According to the above syntax, we initially create an object of the StandardScaler () function.


Video Answer


1 Answers

Just use plain aggregation:

from pyspark.sql.functions import stddev, mean, col

sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")

(sample17
  .select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
  .crossJoin(sample17)
  .withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))

# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# |     2.0|       1.0|  1|      -1.0|
# |     2.0|       1.0|  2|       0.0|
# |     2.0|       1.0|  3|       1.0|
# +--------+----------+---+----------+

or

mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)

# +---+----------+
# |age|age_scaled|
# +---+----------+
# |  1|      -1.0|
# |  2|       0.0|
# |  3|       1.0|
# +---+----------+

If you want Transformer you can split vector into columns.

like image 60
Alper t. Turker Avatar answered Nov 14 '22 21:11

Alper t. Turker