Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often-

1) DataFrame is untyped 2) DataFrame has schema (Like database table which has all information related to table attribute - name, type, not null)

aren't both statements are contradicting ? First we are saying Dataframe is un typed and at the same time we are also saying Dataframe has information about all columns i.e. schema , please help me what i am missing here ? because if dataframe has schema then it is also knowing about type of the columns so how it become un typed ?

like image 777
Rex Avatar asked Sep 12 '18 06:09

Rex


People also ask

How do I see the structure schema of the DataFrame in Spark SQL?

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.

What is typed and untyped in Spark?

DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content.

How do I enforce a schema in Spark DataFrame?

We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.


1 Answers

DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content. This can be useful, because when you go to write transformations on your Dataset, the compiler can check your code for type safety. Take for example this dataset of Pet info. When I use pet.species or pet.name the compiler knows their types at compile time.

case class Pet(name: String, species: String, age: Int, weight: Double)

val data: Dataset[Pet] = Seq(
  Pet("spot", "dog", 2, 50.5),
  Pet("mittens", "cat", 11, 15.5),
  Pet("mickey", "mouse", 1, 1.5)).toDS
println(data.map(x => x.getClass.getSimpleName).first)
// Pet

val newDataset: Dataset[String] = data.map(pet => s"I have a ${pet.species} named ${pet.name}.")

When we switch to using a DataFrame, the schema stays the same and the data is still typed (or structured), but this information is only available at runtime. This is called dynamic typing. This prevents the compiler from catching your mistakes, but it can be very useful because it allows you to write sql like statements and defining new columns on the fly, for example appending columns to an existing DataFrame, without needing to define a new class for every little operation. This flip side is that you can define bad operations that result in nulls or in some cases, runtime errors.

val df: DataFrame = data.toDF
df.printSchema()
// root
// |-- name: string (nullable = true)
// |-- species: string (nullable = true)
// |-- age: integer (nullable = false)
// |-- weight: double (nullable = false)

val newDf: DataFrame = df
  .withColumn("some column", ($"age" + $"weight"))
  .withColumn("bad column", ($"name" + $"age"))
newDf.show()
// +-------+-------+---+------+-----------+----------+
// |   name|species|age|weight|some column|bad column|
// +-------+-------+---+------+-----------+----------+
// |   spot|    dog|  2|  50.5|       52.5|      null|
// |mittens|    cat| 11|  15.5|       26.5|      null|
// | mickey|  mouse|  1|   1.5|        2.5|      null|
// +-------+-------+---+------+-----------+----------+
like image 148
Bi Rico Avatar answered Oct 01 '22 19:10

Bi Rico