which is faster spark.sql or df.filter("").select("") . using scala

Question

I have a table that is a df which also has a view created with

table.createOrReplaceTempView("table")

and the query is

spark.sql("SELECT column1 from TABLE where column2 = 'VALUE'")

i want to rewrite the query as

TABLE.filter(TABLE("column2") === "value").select(col("column1"))

so which query is much faster the normal spark.sql or with filter and select ? when using large datasets.

Jasper-M · Accepted Answer

I assume that if their physical execution plan is exactly the same, performance will be the same as well. So let's do a test, on Spark 2.2.0:

scala> import spark.implicits._
import spark.implicits._

scala> case class Record(column1: String, column2: String)
defined class Record

scala> val table = List(Record("foo", "value"), Record("bar", "notvalue")).toDF
table: org.apache.spark.sql.DataFrame = [column1: string, column2: string]

scala> table.createOrReplaceTempView("table")

scala> val a = spark.sql("SELECT column1 from TABLE where column2 = 'value'")
a: org.apache.spark.sql.DataFrame = [column1: string]

scala> val b = table.filter(table("column2") === "value").select(col("column1")) 
b: org.apache.spark.sql.DataFrame = [column1: string]

scala> a.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
   +- LocalTableScan [column1#41, column2#42]

scala> b.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
   +- LocalTableScan [column1#41, column2#42]

Looks like there's no difference at all...

Sundeep · Answer

Depends on your use case just try both of them which works fast is the best suit for you !

I would recommend you to use

1.spark.time(df.filter(“”).select(“”)) 

2.spark.time(spark.sql(""))

you can print out the time and use the one which takes the least time to execute in your code to run it faster.

which is faster spark.sql or df.filter("").select("") . using scala

Tags:

scala

apache-spark

apache-spark-sql

2 Answers

Jasper-M

Sundeep

Recent Activity

Donate For Us

which is faster spark.sql or df.filter("").select("") . using scala

Tags:

scala

apache-spark

apache-spark-sql

2 Answers

Jasper-M

Sundeep

Related questions

Recent Activity

Donate For Us