I have pyspark dataframe with a column named Filters: "array>"
I want to save my dataframe in csv file, for that i need to cast the array to string type.
I tried to cast it: DF.Filters.tostring()
and DF.Filters.cast(StringType())
, but both solutions generate error message for each row in the columns Filters:
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19
The code is as follows
from pyspark.sql.types import StringType
DF.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Op: string (nullable = true)
|-- Type: string (nullable = true)
|-- Val: string (nullable = true)
DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType()))
DF_cast.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: string (nullable = true)
DF_cast.show()
| ClientNum | Filters
| 32103 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d9e517ce
| 218056 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3c744494
Sample JSON data:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
Thanks !!
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql. functions.
In PySpark SQL, using the cast() function you can convert the DataFrame column from String Type to Double Type or Float Type. This function takes the argument string representing the type you wanted to convert or any type that is a subclass of DataType.
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
I created a sample JSON dataset to match that schema:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false)
+---------+------------------------------------------------------------------+
|ClientNum|Filters |
+---------+------------------------------------------------------------------+
|abc123 |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@60fca57e|
+---------+------------------------------------------------------------------+
Your problem is best solved using the explode() function which flattens an array, then the star expand notation:
s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show()
+---+----+---+
| Op|Type|Val|
+---+----+---+
|foo| bar|baz|
+---+----+---+
To make it a single column string separated by commas:
s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show()
+-----------+
| single_col|
+-----------+
|foo,bar,baz|
+-----------+
Explode Array reference: Flattening Rows in Spark
Star expand reference for "struct" type: How to flatten a struct in a spark dataframe?
For me in Pyspark the function to_json() did the job.
As a plus compared to the simple casting to String, it keeps the "struct keys" as well (not only the "struct values"). So for the reported example I would have something like:
[{"Op":"foo","Type":"bar","Val":"baz"}]
This was much more useful to me since that I had to write results to a Postgres table. In this format I can easily use supported JSON functions in Postgres
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With