Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to interpret results of Spark OneHotEncoder

I read the OHE entry from Spark docs,

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

but sadly they do not give full explanation on the OHE result. So ran the given code:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",      outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

And got the results:

   +---+--------+-------------+-------------+
   | id|category|categoryIndex|  categoryVec|
   +---+--------+-------------+-------------+
   |  0|       a|          0.0|(2,[0],[1.0])|
   |  1|       b|          2.0|    (2,[],[])|
   |  2|       c|          1.0|(2,[1],[1.0])|
   |  3|       a|          0.0|(2,[0],[1.0])|
   |  4|       a|          0.0|(2,[0],[1.0])|
   |  5|       c|          1.0|(2,[1],[1.0])|
   +---+--------+-------------+-------------+

How could I interpret the results of OHE(last column)?

like image 537
Maria Avatar asked Feb 17 '17 10:02

Maria


People also ask

How does OneHotEncoder work?

One-hot encoding is the process by which categorical data are converted into numerical data for use in machine learning. Categorical features are turned into binary features that are “one-hot” encoded, meaning that if a feature is represented by that column, it receives a 1 . Otherwise, it receives a 0 .

What does OneHotEncoder do in Python?

Encode categorical features as a one-hot numeric array. By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

What is OneHotEncoder Pyspark?

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0] .

What do you mean by one hot encoding?

One hot encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model.


1 Answers

One-hot encoding transforms the values in categoryIndex into a binary vector where at maximum one value may be 1. Since there are three values, the vector is of length 2 and the mapping is as follows:

0  -> 10
1  -> 01
2  -> 00

(Why is the mapping like this? See this question about the one-hot encoder dropping the last category.)

The values in column categoryVecare exactly these but represented in sparse format. In this format the zeros of a vector are not printed. The first value (2) shows the length of the vector, the second value is an array that lists zero or more indices where non-zero entries are found. The third value is another array that tells which numbers are found at these indices. So (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.

See: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector

like image 132
moe Avatar answered Oct 08 '22 09:10

moe