Spark SQL performance

Tags:

My code's algorithm as below
Step1. get one hbase entity data to hBaseRDD

      JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = 
                 jsc.newAPIHadoopRDD(hbase_conf,  TableInputFormat.class,
                 ImmutableBytesWritable.class, Result.class);

Step2. transform hBaseRDD to rowPairRDD

     // in the rowPairRDD the key is hbase's row key, The Row is the hbase's Row data 
     JavaPairRDD<String, Row> rowPairRDD = hBaseRDD 
                            .mapToPair(***); 
    dataRDD.repartition(500);
        dataRDD.cache();

Step3. transform rowPairRDD to schemaRDD

            JavaSchemaRDD schemaRDD =   sqlContext.applySchema(rowPairRDD.values(), schema); 
            schemaRDD.registerTempTable("testentity"); 
           sqlContext.sqlContext().cacheTable("testentity");

Step4. use spark sql do the first simple sql query.

   JavaSQLContext  sqlContext = new org.apache.spark.sql.api.java.JavaSQLContext(jsc);
    JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity WHERE 
             column3 = 'value1' ") 
     List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();

Step5. use spark sql do the second simple sql query.

JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity 
                                     WHERE column3 = 'value2' ") 
List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();

Step6. use spark sql do the Third simple sql query.

JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity WHERE column3 = 'value3' "); 
List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();

Test result as below:

Test case1:

When I insert 300,000 records, the hbase entity, then run the code.

the 1st query need 60407 ms
the 2nd query need 838 ms
the 3td query need 792 ms

If I use hbase Api to do the similar query, it only takes 2000 ms. Apparently the last 2 spark sql query are much quicker than the hbase api query.
I believe the 1st spark sql query spends a lot of time to load data from hbase.
So 1st query much slower than the last 2 querys. I think the result is expected

Test case2:

When I insert 400,000 records. the hbase entity, then run the code.

the 1st query need 87213 ms
the 2nd query need 83238 ms
the 3td query need 82092 ms

If I use hbase Api to do the similar query, it only takes 3500 ms. Apparently the 3 spark sql querys are much slower than the hbase api query.
And the last 2 spark sql querys are also very slow and the performance similar to the first query, Why? How can I tune the performance?

721

asked Dec 25 '14 10:12

simafengyun

1 Answers

I suspect you are trying to cache more data than you have allocated to your Spark instance. I'll try to break down what is going on in each execution of the exact same query.

First of all, everything in Spark is lazy. This means that when you call rdd.cache(), nothing actually happens until you do something with the RDD.

First Query

Full HBase scan (slow)
Increase number of partitions (causes shuffle, slow)
Data is actually cached to memory because Spark is lazy (kind of slow)
Apply where predicate (fast)
Results are collected

Second/Third Query

Full in-memory scan (fast)
Apply where predicate (fast)
Results are collected

Now, Spark will try to cache as much of an RDD as possible. If it can't cache the entire thing, you may run into some serious slow downs. This is especially true if one of the steps before caching causes a shuffle. You may be repeating steps 1 - 3 in the first query for each subsequent query. That's not ideal.

To see if you are not fully caching an RDD, go to your Spark Web UI (http://localhost:4040 if in local standalone mode) and look for the RDD storage/persistence information. Make sure it is at 100%.

Edit (per comments):

400,000 data size in my hbase only about 250MB. Why I need to use 2G to fixed the issue(but 1G>>250MB)

I can't say for certain why you hit your max limit with spark.executor.memory=1G, but I will add some more relevant information about caching.

Spark only allocates a percentage of the executor's heap memory to caching. By default, this is spark.storage.memoryFraction=0.6 or 60%. So you are really only getting 1GB * 0.6.
The total space used in HBase likely differs from the total heap space taken when caching in Spark. By default, Spark does not serialize the Java objects when storing in memory. Because of this, there is a decent amount of overhead in storing the Java Object metadata. You can change the default persistence level.

Do you know how to cache all the data to avoid the bad performance for the first query?

Invoking any action will cause the RDD to be cached. Just do this

scala> rdd.cache
scala> rdd.count

Now it's cached.

140

answered Sep 23 '22 02:09

Mike Park

Related questions
                            
                                GCM - Could not post JSON requests to GCM after 6 attempts
                            
                                How to tell Jackson to deserialize "null" string to null literal?
                            
                                Type hierarchy only from JDK
                            
                                Why am I getting 400 bad request with AngularJs post?
                            
                                Canonical representation of a BigDecimal
                            
                                Solr: Scoring exact matches higher than partial matches
                            
                                Issue downloading report exported to PPTX when deployed on server
                            
                                Does Hibernate automatically restart transactions upon deadlocking?
                            
                                Is it possible to disable SSLv3 for all Java applications?
                            
                                How to force Java 8 wsimport command-line to generate 1.6 target compatible code
                            
                                Compiler compiler that consumes BNF
                            
                                Why Java initializing only class variables by default but not local variables?
                            
                                Hibernate Search doesn't index/reindex entities
                            
                                Accessing a HashSet using the HashCode directly? (Java)
                            
                                Why does Oracle's JDBC driver not support Oracle's Boolean type
                            
                                Parsing Multipart/Mixed with Multipart/Alternative body in java
                            
                                Attempt to invoke virtual method 'java.lang.Class java.lang.reflect.Field.getType()' on a null object reference
                            
                                Replace multiple capture groups using regexp with java
                            
                                getting file creator/owner attributes in Java
                            
                                GWT GoogleMaps Hide Default Layers Using Styles

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL performance

Tags:

java

apache-spark

rdd

apache-spark-sql

hbase

simafengyun

People also ask

1 Answers

Mike Park

Recent Activity

Donate For Us