Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

E-num / get Dummies in pyspark

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example

The code in python looks like that:

enum = ['column1','column2']

for e in enum:
    print e
    temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
    data = pd.concat([data,temp], axis=1)
    data.drop(e,axis=1,inplace=True)

data.to_csv('enum_data.csv')
like image 942
T.c Avatar asked Mar 15 '17 09:03

T.c


People also ask

How do you create a dummy column in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I get data size in PySpark?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


1 Answers

First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column. Here is sample code using select statement:-

import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()

OUTPUT

+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
|  1|   A|  X1|       1|       0|       0|        1|        0|        0|
|  2|   B|  X2|       0|       1|       0|        0|        1|        0|
|  3|   B|  X3|       0|       1|       0|        0|        0|        1|
|  1|   B|  X3|       0|       1|       0|        0|        0|        1|
|  2|   C|  X2|       0|       0|       1|        0|        1|        0|
|  3|   C|  X2|       0|       0|       1|        0|        1|        0|
|  1|   C|  X1|       0|       0|       1|        1|        0|        0|
|  1|   B|  X1|       0|       1|       0|        1|        0|        0|
+---+----+----+--------+--------+--------+---------+---------+---------+
like image 156
Rakesh Kumar Avatar answered Oct 09 '22 01:10

Rakesh Kumar