I have a table that is a df which also has a view created with
table.createOrReplaceTempView("table")
and the query is
spark.sql("SELECT column1 from TABLE where column2 = 'VALUE'")
i want to rewrite the query as
TABLE.filter(TABLE("column2") === "value").select(col("column1"))
so which query is much faster the normal spark.sql or with filter and select ? when using large datasets.
I assume that if their physical execution plan is exactly the same, performance will be the same as well. So let's do a test, on Spark 2.2.0:
scala> import spark.implicits._
import spark.implicits._
scala> case class Record(column1: String, column2: String)
defined class Record
scala> val table = List(Record("foo", "value"), Record("bar", "notvalue")).toDF
table: org.apache.spark.sql.DataFrame = [column1: string, column2: string]
scala> table.createOrReplaceTempView("table")
scala> val a = spark.sql("SELECT column1 from TABLE where column2 = 'value'")
a: org.apache.spark.sql.DataFrame = [column1: string]
scala> val b = table.filter(table("column2") === "value").select(col("column1"))
b: org.apache.spark.sql.DataFrame = [column1: string]
scala> a.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
+- LocalTableScan [column1#41, column2#42]
scala> b.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
+- LocalTableScan [column1#41, column2#42]
Looks like there's no difference at all...
Depends on your use case just try both of them which works fast is the best suit for you !
I would recommend you to use
1.spark.time(df.filter(“”).select(“”))
2.spark.time(spark.sql(""))
you can print out the time and use the one which takes the least time to execute in your code to run it faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With