E-num / get Dummies in pyspark

Tags:

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example

The code in python looks like that:

enum = ['column1','column2']

for e in enum:
    print e
    temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
    data = pd.concat([data,temp], axis=1)
    data.drop(e,axis=1,inplace=True)

data.to_csv('enum_data.csv')

942

asked Mar 15 '17 09:03

T.c

1 Answers

First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column. Here is sample code using select statement:-

import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()

OUTPUT

+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
|  1|   A|  X1|       1|       0|       0|        1|        0|        0|
|  2|   B|  X2|       0|       1|       0|        0|        1|        0|
|  3|   B|  X3|       0|       1|       0|        0|        0|        1|
|  1|   B|  X3|       0|       1|       0|        0|        0|        1|
|  2|   C|  X2|       0|       0|       1|        0|        1|        0|
|  3|   C|  X2|       0|       0|       1|        0|        1|        0|
|  1|   C|  X1|       0|       0|       1|        1|        0|        0|
|  1|   B|  X1|       0|       1|       0|        1|        0|        0|
+---+----+----+--------+--------+--------+---------+---------+---------+

156

answered Oct 09 '22 01:10

Rakesh Kumar

Related questions
                            
                                how to add leading zeroes to a pyspark dataframe column
                            
                                Calculate a grouped median in pyspark
                            
                                if else in pyspark for collapsing column values
                            
                                JSON file parsing in Pyspark
                            
                                How to check if array column is inside another column array in PySpark dataframe
                            
                                Count number of columns in pyspark Dataframe?
                            
                                How to concatenate/append multiple Spark dataframes column wise in Pyspark?
                            
                                How to convert empty arrays to nulls?
                            
                                Escape New line character in Spark CSV read
                            
                                Python pandas_udf spark error
                            
                                Unable to install PySpark on Google Colab
                            
                                Store aggregate value of a PySpark dataframe column into a variable
                            
                                Spark __getnewargs__ error ... Method or([class java.lang.String]) does not exist
                            
                                Pyspark: Replace all occurrences of a value with null in dataframe
                            
                                Calculate time between two dates in pyspark
                            
                                Why is input_file_name() empty for S3 catalog sources in pyspark?
                            
                                Rename pivoted and aggregated column in PySpark Dataframe
                            
                                PySpark: Add a new column with a tuple created from columns
                            
                                saving a list of rows to a Hive table in pyspark
                            
                                How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

E-num / get Dummies in pyspark

Tags:

pyspark

pyspark-sql

T.c

People also ask

1 Answers

Rakesh Kumar

Recent Activity

Donate For Us