Spark DataFrame is Untyped vs DataFrame has schema?

Tags:

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often-

1) DataFrame is untyped 2) DataFrame has schema (Like database table which has all information related to table attribute - name, type, not null)

aren't both statements are contradicting ? First we are saying Dataframe is un typed and at the same time we are also saying Dataframe has information about all columns i.e. schema , please help me what i am missing here ? because if dataframe has schema then it is also knowing about type of the columns so how it become un typed ?

777

asked Sep 12 '18 06:09

Rex

1 Answers

DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content. This can be useful, because when you go to write transformations on your Dataset, the compiler can check your code for type safety. Take for example this dataset of Pet info. When I use pet.species or pet.name the compiler knows their types at compile time.

case class Pet(name: String, species: String, age: Int, weight: Double)

val data: Dataset[Pet] = Seq(
  Pet("spot", "dog", 2, 50.5),
  Pet("mittens", "cat", 11, 15.5),
  Pet("mickey", "mouse", 1, 1.5)).toDS
println(data.map(x => x.getClass.getSimpleName).first)
// Pet

val newDataset: Dataset[String] = data.map(pet => s"I have a ${pet.species} named ${pet.name}.")

When we switch to using a DataFrame, the schema stays the same and the data is still typed (or structured), but this information is only available at runtime. This is called dynamic typing. This prevents the compiler from catching your mistakes, but it can be very useful because it allows you to write sql like statements and defining new columns on the fly, for example appending columns to an existing DataFrame, without needing to define a new class for every little operation. This flip side is that you can define bad operations that result in nulls or in some cases, runtime errors.

val df: DataFrame = data.toDF
df.printSchema()
// root
// |-- name: string (nullable = true)
// |-- species: string (nullable = true)
// |-- age: integer (nullable = false)
// |-- weight: double (nullable = false)

val newDf: DataFrame = df
  .withColumn("some column", ($"age" + $"weight"))
  .withColumn("bad column", ($"name" + $"age"))
newDf.show()
// +-------+-------+---+------+-----------+----------+
// |   name|species|age|weight|some column|bad column|
// +-------+-------+---+------+-----------+----------+
// |   spot|    dog|  2|  50.5|       52.5|      null|
// |mittens|    cat| 11|  15.5|       26.5|      null|
// | mickey|  mouse|  1|   1.5|        2.5|      null|
// +-------+-------+---+------+-----------+----------+

148

answered Oct 01 '22 19:10

Bi Rico

Related questions
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                How handle categorical features in the latest Random Forest in Spark?
                            
                                Why is difference between sqlContext.read.load and sqlContext.read.text?
                            
                                Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?
                            
                                How does Serialized RDD occupy less space in memory?
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                Scala: How to combine two data frames?
                            
                                How to implement `except` in Apache Spark based on subset of columns?
                            
                                how to convert a timestamp into string (without changing timezone)?
                            
                                update a dataframe column with new values
                            
                                How YARN knows data locality in Apache spark in cluster mode
                            
                                How do I run Spark jobs concurrently in the same AWS EMR cluster ?
                            
                                S3 Slow Down exception for Spark program [duplicate]
                            
                                Spark Dataframe upsert to Elasticsearch
                            
                                How to cast an array of struct in a spark dataframe using selectExpr?
                            
                                can't resolve ... given input columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark DataFrame is Untyped vs DataFrame has schema?

Tags:

apache-spark

apache-spark-sql

bigdata

Rex

People also ask

1 Answers

Bi Rico

Recent Activity

Donate For Us