Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark query running very slow

i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

    df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
    df.registerTempTable('test')
    d=sqlContext.sql("""

    select user_id from (

    select -- (i1)

        sum(total),

        user_id

    from

        (select --(i2)

            avg(total) as total,

            user_id

        from

                test

        group by

            order_id,

            user_id) as a

    group by

        user_id

    having sum(total) > 0

    ) as b
"""
)

When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

I'm expecting these timings to be closer to 1-2s.

These are my spark configurations:

spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8

rest is set to its default values. Can any one see what i'm missing here ?

like image 510
Arpit Avatar asked Jul 29 '15 07:07

Arpit


2 Answers

  1. Set default.parallelism to 2
  2. Start spark with --num-executor-cores 8
  3. Modify this part

df.registerTempTable('test') d=sqlContext.sql("""...

to

df.registerTempTable('test') sqlContext.cacheTable("test") d=sqlContext.sql("""...

like image 128
Boggio Avatar answered Oct 01 '22 16:10

Boggio


This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!

Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk

They clearly put Spark here:

Spark not so low latency

Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).

like image 38
Thomas Decaux Avatar answered Oct 01 '22 14:10

Thomas Decaux