Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:

I have a DataFrame with the following schema :

|-- id: long (nullable = false)
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: double (nullable = true)
 |    |    |-- y: double (nullable = true)

and the following content:

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|       null|
+---+-----------+

I'd like to replace the null-array (id=2) with an empty array, i.e.

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|         []|
+---+-----------+

I've tried:

val arrSchema = df.schema(1).dataType

df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()

which gives :

java.lang.ClassCastException: org.apache.spark.sql.types.NullType$ cannot be cast to org.apache.spark.sql.types.StructType

Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime

I'm using Spark 2.1 by the way, therefore I cannot use typedLit

like image 357
Raphael Roth Avatar asked Jun 27 '18 10:06

Raphael Roth


People also ask

How do I create a blank DataFrame with schema in Spark Scala?

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

How do I add a blank column to a DataFrame Spark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do you create an array in PySpark?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.

How do I get the schema of a DataFrame in Spark?

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.


2 Answers

  • Spark 2.2+ with known external type

    In general you can use typedLit to provide empty arrays.

    import org.apache.spark.sql.functions.typedLit
    
    typedLit(Seq.empty[(Double, Double)])
    

    To use specific names for nested objects you can use case classes:

    case class Item(x: Double, y: Double)
    
    typedLit(Seq.empty[Item])
    

    or rename by cast:

    typedLit(Seq.empty[(Double, Double)])
      .cast("array<struct<x: Double, y: Double>>")
    
  • Spark 2.1+ with schema only

    With schema only you can try:

    val schema = StructType(Seq(
      StructField("arr", StructType(Seq(
        StructField("x", DoubleType),
        StructField("y", DoubleType)
      )))
    ))
    
    def arrayOfSchema(schema: StructType) =
      from_json(lit("""{"arr": []}"""), schema)("arr")
    
    arrayOfSchema(schema).alias("arr")
    

    where schema can be extracted from the existing DataFrame and wrapped with additional StructType:

    StructType(Seq(
      StructField("arr", df.schema("arr").dataType)
    ))
    
like image 127
Alper t. Turker Avatar answered Sep 18 '22 08:09

Alper t. Turker


One way is the use a UDF :

val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)

val emptyArr = udf(() => Seq.empty[Any],arrSchema)

df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|         []|
+---+-----------+
like image 45
Raphael Roth Avatar answered Sep 20 '22 08:09

Raphael Roth