How to handle categorical features with spark-ml?

Tags:

How do I handle categorical data with spark-ml and not spark-mllib ?

Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.

However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

How should I proceed?

706

asked Aug 28 '15 18:08

Rainmaker

1 Answers

I just wanted to complete Holden's answer.

Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.

In Scala:

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}  val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")  val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index") val encoder = new OneHotEncoderEstimator()   .setInputCols(Array(indexer.getOutputCol, "category2"))   .setOutputCols(Array("category1Vec", "category2Vec"))  val pipeline = new Pipeline().setStages(Array(indexer, encoder))  pipeline.fit(df).transform(df).show // +---+---------+---------+--------------+-------------+-------------+ // | id|category1|category2|category1Index| category1Vec| category2Vec| // +---+---------+---------+--------------+-------------+-------------+ // |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])| // |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])| // |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| // |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| // |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| // |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| // +---+---------+---------+--------------+-------------+-------------+

In Python:

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator  df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])  indexer = StringIndexer(inputCol="category1", outputCol="category1Index") inputs = [indexer.getOutputCol(), "category2"] encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"]) pipeline = Pipeline(stages=[indexer, encoder]) pipeline.fit(df).transform(df).show() # +---+---------+---------+--------------+-------------+-------------+ # | id|category1|category2|category1Index| categoryVec1| categoryVec2| # +---+---------+---------+--------------+-------------+-------------+ # |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])| # |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])| # |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| # |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| # |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| # |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| # +---+---------+---------+--------------+-------------+-------------+

Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.

This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features

Let's consider the following DataFrame:

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))             .toDF("id", "category")

The first step would be to create the indexed DataFrame with the StringIndexer:

import org.apache.spark.ml.feature.StringIndexer  val indexer = new StringIndexer()                    .setInputCol("category")                    .setOutputCol("categoryIndex")                    .fit(df)  val indexed = indexer.transform(df)  indexed.show // +---+--------+-------------+                                                     // | id|category|categoryIndex| // +---+--------+-------------+ // |  0|       a|          0.0| // |  1|       b|          2.0| // |  2|       c|          1.0| // |  3|       a|          0.0| // |  4|       a|          0.0| // |  5|       c|          1.0| // +---+--------+-------------+

You can then encode the categoryIndex with OneHotEncoder :

import org.apache.spark.ml.feature.OneHotEncoder  val encoder = new OneHotEncoder()                    .setInputCol("categoryIndex")                    .setOutputCol("categoryVec")  val encoded = encoder.transform(indexed)  encoded.select("id", "categoryVec").show // +---+-------------+ // | id|  categoryVec| // +---+-------------+ // |  0|(2,[0],[1.0])| // |  1|    (2,[],[])| // |  2|(2,[1],[1.0])| // |  3|(2,[0],[1.0])| // |  4|(2,[0],[1.0])| // |  5|(2,[1],[1.0])| // +---+-------------+

157

answered Oct 05 '22 23:10

eliasah

Related questions
                            
                                Spark: "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression
                            
                                Why Spark SQL considers the support of indexes unimportant?
                            
                                Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
                            
                                Is gzip format supported in Spark?
                            
                                How to read from hbase using spark
                            
                                Get the size/length of an array column
                            
                                What is RDD in spark
                            
                                spark dataframe drop duplicates and keep first
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
                            
                                Pyspark: Parse a column of json strings
                            
                                What is the difference between Apache Spark SQLContext vs HiveContext?
                            
                                Spark RDD to DataFrame python
                            
                                Efficient Count Distinct with Apache Spark
                            
                                Spark extracting values from a Row
                            
                                FetchFailedException or MetadataFetchFailedException when processing big data set
                            
                                How to debug Spark application locally?
                            
                                How do I unit test PySpark programs?
                            
                                Joining Spark dataframes on the key
                            
                                Spark 1.4 increase maxResultSize memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle categorical features with spark-ml?

Tags:

apache-spark

categorical-data

apache-spark-ml

apache-spark-mllib

Rainmaker

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us