How does computing table stats in hive or impala speed up queries in Spark SQL?

Tags:

For increasing performance (e.g. for joins) it is recommended to compute table statics first.

In Hive I can do::

analyze table <table name> compute statistics;

In Impala:

compute stats <table name>;

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics? If yes, which one do I need to run? Are they both saving the stats in the hive metastore? I'm using spark 1.6.1 on Cloudera 5.5.4

Note: In the Docs of spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) for the parameter spark.sql.autoBroadcastJoinThreshold I found a hint:

Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.

567

asked Sep 22 '16 07:09

Raphael Roth

2 Answers

This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).

Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?

It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).

You can easily check it out by using DESCRIBE EXTENDED SQL command in spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

ANALYZE TABLE COMPUTE STATISTICS noscan computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED.

Use DESC EXTENDED tableName for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).

Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.

122

answered Oct 05 '22 12:10

Jacek Laskowski

I am assuming you are using Hive on Spark (or) Spark-Sql with hive context. If that is the case, you should run analyze in hive.

Analyze table<...> typically needs to run after the table is created or if there are significant inserts/changes. You can do this at the end of your load step itself, if this is a MR or spark job.

At the time of analysis, if you are using hive on spark - please also use the configurations in the link below. You can set this at the session level for each query. I have used the parameters in this link https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started in production and it works fine.

answered Oct 05 '22 14:10

ganeiy

Related questions
                            
                                Apache Spark Structured Streaming vs Apache Flink: what is the difference?
                            
                                Spark UI History server on Kubernetes?
                            
                                Spark structured streaming app reading from multiple Kafka topics
                            
                                "TypeError: an integer is required (got type bytes)" when importing pyspark on Python 3.8 [duplicate]
                            
                                Spark Clusters: worker info doesn't show on web UI
                            
                                Apache Spark: How to create a matrix from a DataFrame?
                            
                                How to connect Zeppelin to Spark 1.5 built from the sources?
                            
                                Merging multiple rows in a spark dataframe into a single row
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Is Spark's KMeans unable to handle bigdata?
                            
                                Spark dataframe to arrow
                            
                                Is there a difference between OUTER & FULL_OUTER in Spark SQL?
                            
                                Calculate Cosine Similarity Spark Dataframe
                            
                                SparkSession: ActiveSession vs DefaultSession
                            
                                how to implement spark sql pagination query
                            
                                How to recommend top 10 products in Spark ALS for all the users?
                            
                                Hive UDF for selecting all except some columns
                            
                                pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
                            
                                How does Spark parallelize the processing of a 1TB file?
                            
                                How to retrieve Metrics like Output Size and Records Written from Spark UI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does computing table stats in hive or impala speed up queries in Spark SQL?

Tags:

apache-spark

apache-spark-sql

hive

impala

Raphael Roth

People also ask

2 Answers

Jacek Laskowski

ganeiy

Recent Activity

Donate For Us