Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ?

Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.

However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

How should I proceed?

like image 706
Rainmaker Avatar asked Aug 28 '15 18:08

Rainmaker


People also ask

How does Pyspark handle categorical data?

You can cast a string column type in a spark data frame to a numerical data type using the cast function. In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame.

How do you handle categorical features?

1) Using the categorical variable, evaluate the probability of the Target variable (where the output is True or 1). 2) Calculate the probability of the Target variable having a False or 0 output. 3) Calculate the probability ratio i.e. P(True or 1) / P(False or 0). 4) Replace the category with a probability ratio.

Can you use categorical variables in machine learning?

Since these categorical features cannot be directly used in most machine learning algorithms, the categorical features need to be transformed into numerical features. While numerous techniques exist to transform these features, the most common technique is one-hot encoding.


1 Answers

I just wanted to complete Holden's answer.

Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.

In Scala:

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}  val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")  val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index") val encoder = new OneHotEncoderEstimator()   .setInputCols(Array(indexer.getOutputCol, "category2"))   .setOutputCols(Array("category1Vec", "category2Vec"))  val pipeline = new Pipeline().setStages(Array(indexer, encoder))  pipeline.fit(df).transform(df).show // +---+---------+---------+--------------+-------------+-------------+ // | id|category1|category2|category1Index| category1Vec| category2Vec| // +---+---------+---------+--------------+-------------+-------------+ // |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])| // |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])| // |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| // |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| // |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| // |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| // +---+---------+---------+--------------+-------------+-------------+ 

In Python:

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator  df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])  indexer = StringIndexer(inputCol="category1", outputCol="category1Index") inputs = [indexer.getOutputCol(), "category2"] encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"]) pipeline = Pipeline(stages=[indexer, encoder]) pipeline.fit(df).transform(df).show() # +---+---------+---------+--------------+-------------+-------------+ # | id|category1|category2|category1Index| categoryVec1| categoryVec2| # +---+---------+---------+--------------+-------------+-------------+ # |  0|        a|        1|           0.0|(2,[0],[1.0])|(4,[1],[1.0])| # |  1|        b|        2|           2.0|    (2,[],[])|(4,[2],[1.0])| # |  2|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| # |  3|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| # |  4|        a|        4|           0.0|(2,[0],[1.0])|    (4,[],[])| # |  5|        c|        3|           1.0|(2,[1],[1.0])|(4,[3],[1.0])| # +---+---------+---------+--------------+-------------+-------------+ 

Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.

This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features

Let's consider the following DataFrame:

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))             .toDF("id", "category") 

The first step would be to create the indexed DataFrame with the StringIndexer:

import org.apache.spark.ml.feature.StringIndexer  val indexer = new StringIndexer()                    .setInputCol("category")                    .setOutputCol("categoryIndex")                    .fit(df)  val indexed = indexer.transform(df)  indexed.show // +---+--------+-------------+                                                     // | id|category|categoryIndex| // +---+--------+-------------+ // |  0|       a|          0.0| // |  1|       b|          2.0| // |  2|       c|          1.0| // |  3|       a|          0.0| // |  4|       a|          0.0| // |  5|       c|          1.0| // +---+--------+-------------+ 

You can then encode the categoryIndex with OneHotEncoder :

import org.apache.spark.ml.feature.OneHotEncoder  val encoder = new OneHotEncoder()                    .setInputCol("categoryIndex")                    .setOutputCol("categoryVec")  val encoded = encoder.transform(indexed)  encoded.select("id", "categoryVec").show // +---+-------------+ // | id|  categoryVec| // +---+-------------+ // |  0|(2,[0],[1.0])| // |  1|    (2,[],[])| // |  2|(2,[1],[1.0])| // |  3|(2,[0],[1.0])| // |  4|(2,[0],[1.0])| // |  5|(2,[1],[1.0])| // +---+-------------+ 
like image 157
eliasah Avatar answered Oct 05 '22 23:10

eliasah