I have several categorical features and would like to transform them all using OneHotEncoder
. However, when I tried to apply the StringIndexer
, there I get an error:
stringIndexer = StringIndexer(
inputCol = ['a', 'b','c','d'],
outputCol = ['a_index', 'b_index','c_index','d_index']
)
model = stringIndexer.fit(Data)
An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
Py4JJavaError: An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0] .
One-Hot Encoding with PandasOne-Hot Encoding can be implemented with pandas using the get_dummies function that takes the following parameters (Learn more): data : array-like, Series, or DataFrame — The data containing categorical variables of which to get dummy indicators.
Spark >= 3.0:
In Spark 3.0 OneHotEncoderEstimator
has been renamed to OneHotEncoder
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
encoder = OneHotEncoderEstimator(...)
with
from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel
encoder = OneHotEncoder(...)
Spark >= 2.3
You can use newly added OneHotEncoderEstimator
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
encoder = OneHotEncoderEstimator(
inputCols=[indexer.getOutputCol() for indexer in indexers],
outputCols=[
"{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
)
assembler = VectorAssembler(
inputCols=encoder.getOutputCols(),
outputCol="features"
)
pipeline = Pipeline(stages=indexers + [encoder, assembler])
pipeline.fit(df).transform(df)
Spark < 2.3
It is not possible. StringIndexer
transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
cols = ['a', 'b', 'c', 'd']
indexers = [
StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
for c in cols
]
encoders = [
OneHotEncoder(
inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
assembler = VectorAssembler(
inputCols=[encoder.getOutputCol() for encoder in encoders],
outputCol="features"
)
pipeline = Pipeline(stages=indexers + encoders + [assembler])
pipeline.fit(df).transform(df).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With