I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
sample data from dataframe
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.
Is there any other way I can do it?
This is what I want.
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
You can try getItem(0)
:
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With