When I use onehotencoder in Spark,I will get the result as in fourth column which is a sparse vector.
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
However, what I want is to produce 3 columns for categories just like the way it works in pandas.
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
PySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0] .
Interpretable and CSV writable One Hot Encoding in PySpark To create an interpretable One Hot Encoder, we need to create a separate column for each distinct value. This is easily done using pyspark dataframe's in-built withColumn function by passing a UDF (user-defined function) as a parameter.
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.
Spark's OneHotEncoder creates a sparse vector column. To create the output columns similar to pandas OneHotEncoder, we need to create a separate column for each category. We can do that with the help of pyspark dataframe's withColumn
function by passing a udf as a parameter. For ex -
from pyspark.sql.functions import udf,col
from pyspark.sql.types import IntegerType
df = sqlContext.createDataFrame(sc.parallelize(
[(0,'a'),(1,'b'),(2,'c'),(3,'d')]), ('col1','col2'))
categories = df.select('col2').distinct().rdd.flatMap(lambda x : x).collect()
categories.sort()
for category in categories:
function = udf(lambda item: 1 if item == category else 0, IntegerType())
new_column_name = 'col2'+'_'+category
df = df.withColumn(new_column_name, function(col('col2')))
print df.show()
Output-
+----+----+------+------+------+------+
|col1|col2|col2_a|col2_b|col2_c|col2_d|
+----+----+------+------+------+------+
| 0| a| 1| 0| 0| 0|
| 1| b| 0| 1| 0| 0|
| 2| c| 0| 0| 1| 0|
| 3| d| 0| 0| 0| 1|
+----+----+------+------+------+------+
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With