How do I handle categorical data with spark-ml
and not spark-mllib
?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier
, LogisticRegression
, have a featuresCol
argument, which specifies the name of the column of features in the DataFrame
, and a labelCol
argument, which specifies the name of the column of labeled classes in the DataFrame
.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler
to put all my features in a single vector under featuresCol
.
However, the VectorAssembler
only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
You can cast a string column type in a spark data frame to a numerical data type using the cast function. In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame.
1) Using the categorical variable, evaluate the probability of the Target variable (where the output is True or 1). 2) Calculate the probability of the Target variable having a False or 0 output. 3) Calculate the probability ratio i.e. P(True or 1) / P(False or 0). 4) Replace the category with a probability ratio.
Since these categorical features cannot be directly used in most machine learning algorithms, the categorical features need to be transformed into numerical features. While numerous techniques exist to transform these features, the most common technique is one-hot encoding.
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder
has been deprecated and it will be removed in 3.0.0
. Please use OneHotEncoderEstimator
instead.
In Scala:
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2") val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index") val encoder = new OneHotEncoderEstimator() .setInputCols(Array(indexer.getOutputCol, "category2")) .setOutputCols(Array("category1Vec", "category2Vec")) val pipeline = new Pipeline().setStages(Array(indexer, encoder)) pipeline.fit(df).transform(df).show // +---+---------+---------+--------------+-------------+-------------+ // | id|category1|category2|category1Index| category1Vec| category2Vec| // +---+---------+---------+--------------+-------------+-------------+ // | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])| // | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])| // | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| // | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| // | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| // | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| // +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"]) indexer = StringIndexer(inputCol="category1", outputCol="category1Index") inputs = [indexer.getOutputCol(), "category2"] encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"]) pipeline = Pipeline(stages=[indexer, encoder]) pipeline.fit(df).transform(df).show() # +---+---------+---------+--------------+-------------+-------------+ # | id|category1|category2|category1Index| categoryVec1| categoryVec2| # +---+---------+---------+--------------+-------------+-------------+ # | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])| # | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])| # | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| # | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| # | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| # | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| # +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame
:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")) .toDF("id", "category")
The first step would be to create the indexed DataFrame
with the StringIndexer
:
import org.apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.show // +---+--------+-------------+ // | id|category|categoryIndex| // +---+--------+-------------+ // | 0| a| 0.0| // | 1| b| 2.0| // | 2| c| 1.0| // | 3| a| 0.0| // | 4| a| 0.0| // | 5| c| 1.0| // +---+--------+-------------+
You can then encode the categoryIndex
with OneHotEncoder
:
import org.apache.spark.ml.feature.OneHotEncoder val encoder = new OneHotEncoder() .setInputCol("categoryIndex") .setOutputCol("categoryVec") val encoded = encoder.transform(indexed) encoded.select("id", "categoryVec").show // +---+-------------+ // | id| categoryVec| // +---+-------------+ // | 0|(2,[0],[1.0])| // | 1| (2,[],[])| // | 2|(2,[1],[1.0])| // | 3|(2,[0],[1.0])| // | 4|(2,[0],[1.0])| // | 5|(2,[1],[1.0])| // +---+-------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With