I want to create on <code>DataFrame</code> with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

Lets assume you want a data frame with the following schema: <pre class="prettyprint lang-none prettyprint-override"><code>root |-- k: string (nullable = true) |-- v: integer (nullable = false) </code></pre> You simply define schema for a data frame and use empty <code>RDD[Row]</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.types.{ StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) // Spark < 2.0 // sqlContext.createDataFrame(sc.emptyRDD[Row], schema) spark.createDataFrame(sc.emptyRDD[Row], schema) </code></pre> PySpark equivalent is almost identical: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.types import StructType, StructField, IntegerType, StringType schema = StructType([ StructField("k", StringType(), True), StructField("v", IntegerType(), False) ]) # or df = sc.parallelize([]).toDF(schema) # Spark < 2.0 # sqlContext.createDataFrame([], schema) df = spark.createDataFrame([], schema) </code></pre> Using implicit encoders (Scala only) with <code>Product</code> types like <code>Tuple</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import spark.implicits._ Seq.empty[(String, Int)].toDF("k", "v") </code></pre> or case class: <pre class="prettyprint lang-scala prettyprint-override"><code>case class KV(k: String, v: Int) Seq.empty[KV].toDF </code></pre> or <pre class="prettyprint"><code>spark.emptyDataset[KV].toDF </code></pre>

As of Spark 2.0.0, you can do the following. <h3>Case Class</h3> Let's define a <code>Person</code> case class: <pre class="prettyprint"><code>scala> case class Person(id: Int, name: String) defined class Person </code></pre> Import <code>spark</code> SparkSession implicit <code>Encoders</code>: <pre class="prettyprint"><code>scala> import spark.implicits._ import spark.implicits._ </code></pre> And use SparkSession to create an empty <code>Dataset[Person]</code>: <pre class="prettyprint"><code>scala> spark.emptyDataset[Person] res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string] </code></pre> <h3>Schema DSL</h3> You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName). <pre class="prettyprint"><code>scala> val id = $"id".int id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true) scala> val name = $"name".string name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true) scala> import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StructType scala> val mySchema = StructType(id :: name :: Nil) mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true)) scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema) emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string] scala> emptyDF.printSchema root |-- id: integer (nullable = true) |-- name: string (nullable = true) </code></pre>

How to create an empty DataFrame with a specified schema?

2 Answers

Lets assume you want a data frame with the following schema:

root  |-- k: string (nullable = true)  |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

import org.apache.spark.sql.types.{     StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row  val schema = StructType(     StructField("k", StringType, true) ::     StructField("v", IntegerType, false) :: Nil)  // Spark < 2.0 // sqlContext.createDataFrame(sc.emptyRDD[Row], schema)  spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType  schema = StructType([     StructField("k", StringType(), True), StructField("v", IntegerType(), False) ])  # or df = sc.parallelize([]).toDF(schema)  # Spark < 2.0  # sqlContext.createDataFrame([], schema) df = spark.createDataFrame([], schema)

Using implicit encoders (Scala only) with Product types like Tuple:

import spark.implicits._  Seq.empty[(String, Int)].toDF("k", "v")

or case class:

case class KV(k: String, v: Int)  Seq.empty[KV].toDF

spark.emptyDataset[KV].toDF

answered Oct 06 '22 10:10

zero323

As of Spark 2.0.0, you can do the following.

Case Class

Let's define a Person case class:

scala> case class Person(id: Int, name: String) defined class Person

Import spark SparkSession implicit Encoders:

scala> import spark.implicits._ import spark.implicits._

And use SparkSession to create an empty Dataset[Person]:

scala> spark.emptyDataset[Person] res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]

Schema DSL

You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).

scala> val id = $"id".int id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)  scala> val name = $"name".string name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)  scala> import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StructType  scala> val mySchema = StructType(id :: name :: Nil) mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))  scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row  scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema) emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]  scala> emptyDF.printSchema root  |-- id: integer (nullable = true)  |-- name: string (nullable = true)

answered Oct 06 '22 09:10

Jacek Laskowski

Related questions
                            
                                Why are `private val` and `private final val` different?
                            
                                How can I convert immutable.Map to mutable.Map in Scala?
                            
                                Match multiple cases classes in scala
                            
                                Scala type programming resources
                            
                                Naming convention for Scala constants?
                            
                                Functional programming - is immutability expensive? [closed]
                            
                                What does param: _* mean in Scala?
                            
                                Elegant way to invert a map in Scala
                            
                                What Automatic Resource Management alternatives exist for Scala?
                            
                                I want to get the type of a variable at runtime
                            
                                java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7
                            
                                Scala: What is the difference between Traversable and Iterable traits in Scala collections?
                            
                                What is the correct way to get a subarray in Scala?
                            
                                Where does the "flatmap that s***" idiomatic expression in Scala come from?
                            
                                How can sbt pull dependency artifacts from git?
                            
                                How to compile tests with SBT without running them
                            
                                Extract column values of Dataframe as List in Apache Spark
                            
                                Avoiding memory leaks with Scalaz 7 zipWithIndex/group enumeratees
                            
                                Scala framework for a Rest API Server? [closed]
                            
                                What are the disadvantages to declaring Scala case classes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create an empty DataFrame with a specified schema?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

user1735076

People also ask