Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite: I have a DataFrame with the following schema : <pre class="prettyprint"><code>|-- id: long (nullable = false) |-- arr: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- x: double (nullable = true) | | |-- y: double (nullable = true) </code></pre> and the following content: <pre class="prettyprint"><code>+---+-----------+ | id| arr| +---+-----------+ | 1|[[1.0,2.0]]| | 2| null| +---+-----------+ </code></pre> I'd like to replace the null-array (id=2) with an empty array, i.e. <pre class="prettyprint"><code>+---+-----------+ | id| arr| +---+-----------+ | 1|[[1.0,2.0]]| | 2| []| +---+-----------+ </code></pre> I've tried: <pre class="prettyprint"><code>val arrSchema = df.schema(1).dataType df .withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr")) .show() </code></pre> which gives : <blockquote> java.lang.ClassCastException: org.apache.spark.sql.types.NullType$ cannot be cast to org.apache.spark.sql.types.StructType </blockquote> Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from <code>df</code> at runtime I'm using Spark 2.1 by the way, therefore I cannot use <code>typedLit</code>

<ul> <li> Spark 2.2+ with known external type In general you can use <code>typedLit</code> to provide empty arrays. <pre class="prettyprint"><code>import org.apache.spark.sql.functions.typedLit typedLit(Seq.empty[(Double, Double)]) </code></pre> To use specific names for nested objects you can use case classes: <pre class="prettyprint"><code>case class Item(x: Double, y: Double) typedLit(Seq.empty[Item]) </code></pre> or rename by cast: <pre class="prettyprint"><code>typedLit(Seq.empty[(Double, Double)]) .cast("array<struct<x: Double, y: Double>>") </code></pre> </li> <li> Spark 2.1+ with schema only With schema only you can try: <pre class="prettyprint"><code>val schema = StructType(Seq( StructField("arr", StructType(Seq( StructField("x", DoubleType), StructField("y", DoubleType) ))) )) def arrayOfSchema(schema: StructType) = from_json(lit("""{"arr": []}"""), schema)("arr") arrayOfSchema(schema).alias("arr") </code></pre> where <code>schema</code> can be extracted from the existing <code>DataFrame</code> and wrapped with additional <code>StructType</code>: <pre class="prettyprint"><code>StructType(Seq( StructField("arr", df.schema("arr").dataType) )) </code></pre> </li> </ul>

create empty array-column of given schema in Spark

Tags:

scala

apache-spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:

I have a DataFrame with the following schema :

|-- id: long (nullable = false)
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: double (nullable = true)
 |    |    |-- y: double (nullable = true)

and the following content:

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|       null|
+---+-----------+

I'd like to replace the null-array (id=2) with an empty array, i.e.

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|         []|
+---+-----------+

I've tried:

val arrSchema = df.schema(1).dataType

df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()

which gives :

java.lang.ClassCastException: org.apache.spark.sql.types.NullType$ cannot be cast to org.apache.spark.sql.types.StructType

Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime

I'm using Spark 2.1 by the way, therefore I cannot use typedLit

357

asked Jun 27 '18 10:06

Raphael Roth

2 Answers

Spark 2.2+ with known external type

In general you can use typedLit to provide empty arrays.

import org.apache.spark.sql.functions.typedLit

typedLit(Seq.empty[(Double, Double)])

To use specific names for nested objects you can use case classes:

case class Item(x: Double, y: Double)

typedLit(Seq.empty[Item])

or rename by cast:

typedLit(Seq.empty[(Double, Double)])
  .cast("array<struct<x: Double, y: Double>>")

Spark 2.1+ with schema only

With schema only you can try:

val schema = StructType(Seq(
  StructField("arr", StructType(Seq(
    StructField("x", DoubleType),
    StructField("y", DoubleType)
  )))
))

def arrayOfSchema(schema: StructType) =
  from_json(lit("""{"arr": []}"""), schema)("arr")

arrayOfSchema(schema).alias("arr")

where schema can be extracted from the existing DataFrame and wrapped with additional StructType:

StructType(Seq(
  StructField("arr", df.schema("arr").dataType)
))

127

answered Sep 18 '22 08:09

Alper t. Turker

One way is the use a UDF :

val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)

val emptyArr = udf(() => Seq.empty[Any],arrSchema)

df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()

+---+-----------+
| id|        arr|
+---+-----------+
|  1|[[1.0,2.0]]|
|  2|         []|
+---+-----------+

answered Sep 20 '22 08:09

Raphael Roth

Related questions
                            
                                Tail recursion vs head classic recursion
                            
                                json4s: Convert type to JValue
                            
                                Convert scala future to java future
                            
                                Empty output for Watermarked Aggregation Query in Append Mode
                            
                                Chained comparisons in Scala
                            
                                How to unimport String "+" operator in Scala?
                            
                                What's the name of this Scala infix syntax for specifying type params?
                            
                                Why do case classes extend only Product and not Product1, Product2, ..., ProductN?
                            
                                Installing sbteclipse
                            
                                How to have SBT subproject with multiple Scala versions?
                            
                                How to test client-side Akka HTTP
                            
                                wrong top statement declaration in scala IntelliJ
                            
                                SLICK How to define bidirectional one-to-many relationship for use in case class
                            
                                Is there a way to include math formulae in Scaladoc?
                            
                                Value classes introduce unwanted public methods
                            
                                Compose partial functions
                            
                                How to save models from ML Pipeline to S3 or HDFS?
                            
                                Converting Typesafe Config type to java.util.Properties
                            
                                Sequencing and overriding tasks in SBT
                            
                                How to encode/decode Timestamp for json in circe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

create empty array-column of given schema in Spark

Tags:

scala

apache-spark

Raphael Roth

People also ask

2 Answers

Alper t. Turker

Raphael Roth

Recent Activity

Donate For Us