Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert PySpark dataframe column type to string and replace the square brackets

I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount

>>> plan_queryDF.printSchema()
root
 |-- event_type: string (nullable = true)
 |-- publishedDate: string (nullable = true)
 |-- plannedCustomerChoiceID: string (nullable = true)
 |-- assortedCustomerChoiceID: string (nullable = true)
 |-- CurrencyCode: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TicketAmount: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentPlan: boolean (nullable = true)
 |-- originalPlan: boolean (nullable = true)
 |-- globalId: string (nullable = true)
 |-- PlanJsonData: string (nullable = true)

sample data from dataframe

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       [GBP]|         [0]|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       [CNY]|       [329]|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       [JPY]|      [3400]|       true|       false|000576058003|{"httpStatus":200...|

how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.

Is there any other way I can do it?

This is what I want.

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       GBP|         0|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       CNY|       329|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       JPY|      3400|       true|       false|000576058003|{"httpStatus":200...|
like image 269
ben Avatar asked Dec 16 '16 12:12

ben


1 Answers

You can try getItem(0):

df \
    .withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
    .withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string")) 

The final cast to string is optional.

like image 155
Daniel de Paula Avatar answered Oct 05 '22 19:10

Daniel de Paula