I am hoping to dummy encode my categorical variables to numerical variables like shown in the image below, using Pyspark syntax.
I read in data like this
data = sqlContext.read.csv("data.txt", sep = ";", header = "true")
In python I am able to encode my variables using the below code
data = pd.get_dummies(data, columns = ['Continent'])
However I am not sure how to do it in Pyspark.
Any assistance would be greatly appreciated.
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of dummy variables that is equal to the number of categories (k) in the variable, dummy encoding uses k-1 dummy variables.
One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.
Try this:
import pyspark.sql.functions as F
categ = df.select('Continent').distinct().rdd.flatMap(lambda x:x).collect()
exprs = [F.when(F.col('Continent') == cat,1).otherwise(0)\
.alias(str(cat)) for cat in categ]
df = df.select(exprs+df.columns)
Exclude df.columns if you do not want the original columns in your transformed dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With