Dummy Encoding using Pyspark [duplicate]

Tags:

enter image description here

I am hoping to dummy encode my categorical variables to numerical variables like shown in the image below, using Pyspark syntax.

I read in data like this

data = sqlContext.read.csv("data.txt", sep = ";", header = "true")

In python I am able to encode my variables using the below code

data = pd.get_dummies(data, columns = ['Continent'])

However I am not sure how to do it in Pyspark.

Any assistance would be greatly appreciated.

564

asked Oct 02 '17 15:10

ALK

1 Answers

Try this:

import pyspark.sql.functions as F 
categ = df.select('Continent').distinct().rdd.flatMap(lambda x:x).collect()
exprs = [F.when(F.col('Continent') == cat,1).otherwise(0)\
            .alias(str(cat)) for cat in categ]
df = df.select(exprs+df.columns)

Exclude df.columns if you do not want the original columns in your transformed dataframe.

142

answered Oct 26 '22 15:10

mayank agrawal

Related questions
                            
                                Spark remove duplicate rows from DataFrame [duplicate]
                            
                                Predict clusters from data using Spark MLlib KMeans
                            
                                RandomForestClassifier was given input with invalid label column error in Apache Spark
                            
                                What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?
                            
                                Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)
                            
                                save dataframe as external hive table
                            
                                How to implement LEAD and LAG in Spark-scala
                            
                                How to access elemens in Row RDD in SCALA
                            
                                Apache Spark - Backend servers
                            
                                spark Type mismatch: cannot convert from JavaRDD<Object> to JavaRDD<String>
                            
                                How does MapReduce recover from errors if failure happens in an intermediate stage
                            
                                Spark 2.0 ALS Recommendation how to recommend to a user
                            
                                Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?
                            
                                Spark and profiling or execution plan
                            
                                How do Spark scheduler pools work when running on YARN?
                            
                                Converting pattern of date in spark dataframe
                            
                                How to convert RDD[Row] to RDD[String]
                            
                                What is the faster way to count the number of entries in a data frame?
                            
                                apache-spark startup error on alpine linux docker
                            
                                Spark Scala Dataframe convert a column of Array of Struct to a column of Map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dummy Encoding using Pyspark [duplicate]

Tags:

encoding

apache-spark

pyspark

dummy-variable

ALK

People also ask

1 Answers

mayank agrawal

Recent Activity

Donate For Us