I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often-
1) DataFrame is untyped 2) DataFrame has schema (Like database table which has all information related to table attribute - name, type, not null)
aren't both statements are contradicting ? First we are saying Dataframe is un typed and at the same time we are also saying Dataframe has information about all columns i.e. schema , please help me what i am missing here ? because if dataframe has schema then it is also knowing about type of the columns so how it become un typed ?
To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.
DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content.
We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content. This can be useful, because when you go to write transformations on your Dataset, the compiler can check your code for type safety. Take for example this dataset of Pet info. When I use pet.species
or pet.name
the compiler knows their types at compile time.
case class Pet(name: String, species: String, age: Int, weight: Double)
val data: Dataset[Pet] = Seq(
Pet("spot", "dog", 2, 50.5),
Pet("mittens", "cat", 11, 15.5),
Pet("mickey", "mouse", 1, 1.5)).toDS
println(data.map(x => x.getClass.getSimpleName).first)
// Pet
val newDataset: Dataset[String] = data.map(pet => s"I have a ${pet.species} named ${pet.name}.")
When we switch to using a DataFrame, the schema stays the same and the data is still typed (or structured), but this information is only available at runtime. This is called dynamic typing. This prevents the compiler from catching your mistakes, but it can be very useful because it allows you to write sql like statements and defining new columns on the fly, for example appending columns to an existing DataFrame, without needing to define a new class for every little operation. This flip side is that you can define bad operations that result in nulls or in some cases, runtime errors.
val df: DataFrame = data.toDF
df.printSchema()
// root
// |-- name: string (nullable = true)
// |-- species: string (nullable = true)
// |-- age: integer (nullable = false)
// |-- weight: double (nullable = false)
val newDf: DataFrame = df
.withColumn("some column", ($"age" + $"weight"))
.withColumn("bad column", ($"name" + $"age"))
newDf.show()
// +-------+-------+---+------+-----------+----------+
// | name|species|age|weight|some column|bad column|
// +-------+-------+---+------+-----------+----------+
// | spot| dog| 2| 50.5| 52.5| null|
// |mittens| cat| 11| 15.5| 26.5| null|
// | mickey| mouse| 1| 1.5| 2.5| null|
// +-------+-------+---+------+-----------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With