Why does Spark's OneHotEncoder drop the last category by default?

Tags:

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

For example:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

Question::

In what case is the default behavior desirable?
What issues might be overlooked by blindly calling setDropLast(False)?
What do the authors mean by the following statment in the documentation?

The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent.

498

asked Sep 14 '16 21:09

Corey

1 Answers

According to the doc it is to keep the column independents:

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via OneHotEncoder!.dropLast because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.

https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/ml/feature/OneHotEncoder.html

127

answered Oct 20 '22 01:10

Romain Jouin

Related questions
                            
                                Spark Dataframes UPSERT to Postgres Table
                            
                                spark sql window function lag
                            
                                Apache Spark java.lang.ClassNotFoundException
                            
                                Spark can access Hive table from pyspark but not from spark-submit
                            
                                SparkSQL : Can I explode two different variables in the same query?
                            
                                Create DataFrame with null value for few column
                            
                                Multiple SparkSessions in single JVM
                            
                                Spark dataframe filter
                            
                                Spark Dataframe groupBy and sort results into a list
                            
                                Concatenating string by rows in pyspark
                            
                                How to do opposite of explode in PySpark?
                            
                                Spark2.2.1 incompatible Jackson version 2.8.8
                            
                                Passing command line arguments to Spark-shell
                            
                                How to drop multiple column names given in a list from Spark DataFrame?
                            
                                Failed to start master for Spark in Windows
                            
                                How to exit spark-submit after the submission
                            
                                Spark Random Forests: Different results with same seed
                            
                                Does Spark support Partition Pruning with Parquet Files
                            
                                Spark Kafka Direct DStream - How many executors and RDD partitions in yarn-cluster mode if num-executors is set?
                            
                                Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Spark's OneHotEncoder drop the last category by default?

Tags:

machine-learning

apache-spark

one-hot-encoding

pyspark

bigdata

Corey

People also ask

1 Answers

Romain Jouin

Recent Activity

Donate For Us