How to make onehotencoder in Spark to work like onehotencoder in Pandas?

Tags:

When I use onehotencoder in Spark,I will get the result as in fourth column which is a sparse vector.

// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

However, what I want is to produce 3 columns for categories just like the way it works in pandas.

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

683

asked Mar 18 '17 15:03

Mohamed Ibrahim

1 Answers

Spark's OneHotEncoder creates a sparse vector column. To create the output columns similar to pandas OneHotEncoder, we need to create a separate column for each category. We can do that with the help of pyspark dataframe's withColumn function by passing a udf as a parameter. For ex -

from pyspark.sql.functions import udf,col
from pyspark.sql.types import IntegerType


df = sqlContext.createDataFrame(sc.parallelize(
        [(0,'a'),(1,'b'),(2,'c'),(3,'d')]), ('col1','col2'))

categories = df.select('col2').distinct().rdd.flatMap(lambda x : x).collect()
categories.sort()
for category in categories:
    function = udf(lambda item: 1 if item == category else 0, IntegerType())
    new_column_name = 'col2'+'_'+category
    df = df.withColumn(new_column_name, function(col('col2')))

print df.show()

Output-

+----+----+------+------+------+------+                                         
|col1|col2|col2_a|col2_b|col2_c|col2_d|
+----+----+------+------+------+------+
|   0|   a|     1|     0|     0|     0|
|   1|   b|     0|     1|     0|     0|
|   2|   c|     0|     0|     1|     0|
|   3|   d|     0|     0|     0|     1|
+----+----+------+------+------+------+

I hope this helps.

103

answered Oct 12 '22 12:10

arker296

Related questions
                            
                                Reading Avro File in Spark
                            
                                Running Spark driver program in Docker container - no connection back from executor to the driver?
                            
                                Drop if all entries in a spark dataframe's specific column is null
                            
                                How to add a column to the beginning of the schema?
                            
                                spark [dataframe].write.option("mode","overwrite").saveAsTable("foo") fails with 'already exists' if foo exists
                            
                                how to use jni in spark?
                            
                                saveTocassandra could not find implicit value for parameter rwf
                            
                                how to print out snippets of a RDD in the spark-shell / pyspark?
                            
                                Permission denied when starting spark Command line on AWS EMR cluster
                            
                                Spark 1.6.1 S3 MultiObjectDeleteException
                            
                                Spark - Datediff for months?
                            
                                Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?
                            
                                sparksql drop hive table
                            
                                Connect sparklyr to remote spark connection
                            
                                How to save Spark RDD to local filesystem
                            
                                Will Spark SQL completely replace Apache Impala or Apache Hive?
                            
                                Filter dataframe by value NOT present in column of other dataframe [duplicate]
                            
                                Pyspark read multiple csv files into a dataframe (OR RDD?)
                            
                                how to handle millions of smaller s3 files with apache spark
                            
                                pyspark merge two rdd together

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make onehotencoder in Spark to work like onehotencoder in Pandas?

Tags:

apache-spark

one-hot-encoding

pyspark

Mohamed Ibrahim

People also ask

1 Answers

arker296

Recent Activity

Donate For Us