The ANALYZE TABLE command run from Spark on a Hive table does not give the same performance improvement as the same command issued from Hive.
For example, I have inserted a dataframe into an empty Hive table:-
output.write.insertInto(“XXXXXXXX”)
and then run the analyze table command:-
spark.sql("ANALYZE TABLE XXXXXXXX COMPUTE STATISTICS")
When I do a record count in Hive it is very slow:-
select count(*) from XXXXXXXX;
+-----------+
|    _c0    |
+-----------+
| 12345678  |
+-----------+
1 row selected (36.038 seconds)
But if I run the same analyze table command directly in Hive, performance improves:-
select count(*) from XXXXXXXX;
+-----------+
|    _c0    |
+-----------+
| 12345678  |
+-----------+
1 row selected (0.096 seconds)
Can anyone explain why this difference occurs? Is there a workaround?
it's because spark is stupid. Spark's ANALYZE TABLE command only writes the computed statistics in a Spark-compatible format to the table properties that Hive is unaware of. Spark completely ignores the standard Hive table statistics when writing those.
if you do a
show create table XXXXXXXX
in Hive after the spark.sql("ANALYZE..") step in your query, you will see the following in the tbl_properties section:
TBLPROPERTIES ( 
  'numFiles'='1', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'spark.sql.statistics.numRows'='12345678', 
  'spark.sql.statistics.totalSize'='12345678', 
  'totalSize'='12345678', 
only after doing the same command in hive you will get the correct information:
TBLPROPERTIES ( 
  'numFiles'='1', 
  'numRows'='12345678', 
  'rawDataSize'='12345678', 
  'spark.sql.statistics.numRows'='12345678', 
  'spark.sql.statistics.totalSize'='12345678', 
  'totalSize'='12345678', 
Could Spark fill the numRows and rawDataSize field correctly when computing statistics? Probably. Why does it not do it? No idea. :( Probably the spark devs are above doing optimizations for such lowly systems like Hive. (even though other frameworks like Apache Impala would also benefit from correct stats here - because Impala uses them correctly)
The only workaround is to follow up your Spark job with a corresponding hive job doing the compute statistics command, which is really ugly. :(
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With