How to interpret results of Spark OneHotEncoder

Tags:

I read the OHE entry from Spark docs,

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

but sadly they do not give full explanation on the OHE result. So ran the given code:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",      outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

And got the results:

   +---+--------+-------------+-------------+
   | id|category|categoryIndex|  categoryVec|
   +---+--------+-------------+-------------+
   |  0|       a|          0.0|(2,[0],[1.0])|
   |  1|       b|          2.0|    (2,[],[])|
   |  2|       c|          1.0|(2,[1],[1.0])|
   |  3|       a|          0.0|(2,[0],[1.0])|
   |  4|       a|          0.0|(2,[0],[1.0])|
   |  5|       c|          1.0|(2,[1],[1.0])|
   +---+--------+-------------+-------------+

How could I interpret the results of OHE(last column)?

537

asked Feb 17 '17 10:02

Maria

1 Answers

One-hot encoding transforms the values in categoryIndex into a binary vector where at maximum one value may be 1. Since there are three values, the vector is of length 2 and the mapping is as follows:

0  -> 10
1  -> 01
2  -> 00

(Why is the mapping like this? See this question about the one-hot encoder dropping the last category.)

The values in column categoryVecare exactly these but represented in sparse format. In this format the zeros of a vector are not printed. The first value (2) shows the length of the vector, the second value is an array that lists zero or more indices where non-zero entries are found. The third value is another array that tells which numbers are found at these indices. So (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.

See: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector

132

answered Oct 08 '22 09:10

moe

Related questions
                            
                                What is the difference between print and print() in python 2.7
                            
                                filter pandas dataframe by time
                            
                                maptplotlib imshow() does nothing [duplicate]
                            
                                "Cannot update a query once a slice has been taken". Best practices?
                            
                                Listen to keypress with asyncio
                            
                                Pandas and Rolling_Mean with Offset (Average Daily Volume Calculation)
                            
                                Why does bit-wise shift left return different results in Python and Java?
                            
                                AWS Lambda not importing LXML
                            
                                Multiple aliases on one-line Python import
                            
                                Why am I getting tile cannot extend outside image error when trying to split image in half
                            
                                Adding multiple layers to TensorFlow causes loss function to become Nan
                            
                                py.test : Can multiple markers be applied at the test function level?
                            
                                How to add data labels to a bar chart in Bokeh?
                            
                                How to serialize a one to many relation in django-rest using Model serializer?
                            
                                How to plot a jointplot with 'hue' parameter in seaborn
                            
                                Error with "__str__ returned non-string (type int)"
                            
                                quickest way to to convert list of tuples to a series
                            
                                Calculating the inverse of a matrix with pandas
                            
                                EOFError: Compressed file ended before the end-of-stream marker was reached - MNIST data set
                            
                                Why my zsh in PyCharm doesn't have correct $PATH?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to interpret results of Spark OneHotEncoder

Tags:

python

apache-spark

one-hot-encoding

pyspark

Maria

People also ask

1 Answers

moe

Recent Activity

Donate For Us