apply OneHotEncoder for several categorical columns in SparkMlib

Tags:

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error:

stringIndexer = StringIndexer(
    inputCol = ['a', 'b','c','d'],
    outputCol = ['a_index', 'b_index','c_index','d_index']
)  

model = stringIndexer.fit(Data)

An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
    at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Traceback (most recent call last):
Py4JJavaError: An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
    at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

551

asked Mar 04 '16 19:03

MYjx

1 Answers

Spark >= 3.0:

In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder:

from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel

encoder = OneHotEncoderEstimator(...)

with

from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel

encoder = OneHotEncoder(...)

Spark >= 2.3

You can use newly added OneHotEncoderEstimator:

from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel

encoder = OneHotEncoderEstimator(
    inputCols=[indexer.getOutputCol() for indexer in indexers],
    outputCols=[
        "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
)

assembler = VectorAssembler(
    inputCols=encoder.getOutputCols(),
    outputCol="features"
)

pipeline = Pipeline(stages=indexers + [encoder, assembler])
pipeline.fit(df).transform(df)

Spark < 2.3

It is not possible. StringIndexer transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

cols = ['a', 'b', 'c', 'd']

indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in cols
]

encoders = [
    OneHotEncoder(
        inputCol=indexer.getOutputCol(),
        outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers
]

assembler = VectorAssembler(
    inputCols=[encoder.getOutputCol() for encoder in encoders],
    outputCol="features"
)


pipeline = Pipeline(stages=indexers + encoders + [assembler])
pipeline.fit(df).transform(df).show()

169

answered Oct 12 '22 11:10

zero323

Related questions
                            
                                Python - best way to set a column in a 2d array to a specific value
                            
                                Simulating python's With statement in java
                            
                                Check if a number is rational in Python, for a given fp accuracy
                            
                                Python - delete old files
                            
                                Creating a relative symlink in python without using os.chdir()
                            
                                Splitting a string into 2-letter segments [duplicate]
                            
                                one line if else condition in python
                            
                                Python read() function returns empty string [closed]
                            
                                Using a while loop as a wait in python
                            
                                how do I use ensure_csrf_cookie?
                            
                                Install Multiprocessing python3
                            
                                OpenCV cv2.imshow is not working because of the qt
                            
                                How to get the biggest numbers out from huge amount of numbers?
                            
                                Calling a python script from command line without typing "python" first
                            
                                How do I check if a list is sorted? [duplicate]
                            
                                Strip string after third occurrence of character python
                            
                                multiprocessing.pool.map and function with two arguments
                            
                                Parse string to int when string contains a number + extra characters [closed]
                            
                                Sum elements of the list in Jinja 2
                            
                                summing only the numbers contained in a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

apply OneHotEncoder for several categorical columns in SparkMlib

Tags:

python

apache-spark

pyspark

apache-spark-ml

apache-spark-mllib

MYjx

People also ask

1 Answers

zero323

Recent Activity

Donate For Us