Spark SQL performance: version 1.6 vs version 1.5

Tags:

I have tried to compare the performance of Spark SQL version 1.6 and version 1.5. In a simple case, Spark 1.6 is quite faster than Spark 1.5. However, in a more complex query - in my case an aggregation query with grouping sets, Spark SQL version 1.6 is very much slower than Spark SQL version 1.5. Does anybody notice the same issue? and even better have a solution for this kind of query?

Here is my code

case class Toto(
                 a: String = f"${(math.random*1e6).toLong}%06.0f",
                 b: String = f"${(math.random*1e6).toLong}%06.0f",
                 c: String = f"${(math.random*1e6).toLong}%06.0f",
                 n: Int = (math.random*1e3).toInt,
                 m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2, SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"

val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )

Here are 2 screenshots Spark 1.5.2 and Spark 1.6.0 with --driver-memory=1G. The DAG on Spark 1.6.0 can be viewed at DAG.

812

asked Feb 03 '16 15:02

Tien-Dung Le

1 Answers

Thanks Herman van Hövell for his reply on spark dev community. In order to share with other members, I share his response here.

1.6 plans single distinct aggregates like multiple distinct aggregates; this inherently causes some overhead but is more stable in case of high cardinalities. You can revert to the old behavior by setting the spark.sql.specializeSingleDistinctAggPlanning option to false. See also: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462

Actually in order to revert the setting value should be "true".

175

answered Nov 14 '22 21:11

Tien-Dung Le

Related questions
                            
                                Does JS always use two bytes per character to store strings?
                            
                                How to limit Jenkins child processes?
                            
                                R loop getting slower and slower
                            
                                Delayed jQuery slideToggle animation
                            
                                Haskell: sub-optimal parallel GC work balance, no speedup in parallel execution
                            
                                Haskell Repa stencil hacks
                            
                                Matplotlib slow with large data sets, how to enable decimation?
                            
                                Selecting n random rows from a huge database with conditions
                            
                                UDP sendto performance over loopback
                            
                                32 byte store forwarding on Sandy Bridge
                            
                                Cold Start Performance WPF
                            
                                Optimizing rank computation for very large sparse matrices
                            
                                Is it a good practice to set expire for all keys in redis
                            
                                Statistics for java HashMap [closed]
                            
                                .NET thread in 'pre-emptive GC disabled' mode which blocking GC and potentially cause a deadlock
                            
                                Slow performance of pandas timestamp vs datetime
                            
                                Why is batch mode so much faster than parfor?
                            
                                Perl script slowing down as it progresses
                            
                                Unit test with testNG in spring boot takes time to build project
                            
                                Unable to run "ANY" react native example project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL performance: version 1.6 vs version 1.5

Tags:

performance

apache-spark

apache-spark-sql

Tien-Dung Le

People also ask

1 Answers

Tien-Dung Le

Recent Activity

Donate For Us