Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).

The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?

A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:

new_table = table.withColumn("new_col1", F.col("col1")

and when I run the below, the metastore checker happens multiple times, and therefore takes longer:

new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)

The metastore checks it's doing look like this in the driver node:

20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: xxx

The view to the user on databricks is:

Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...

I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.

like image 733
Louise Fallon Avatar asked Oct 20 '25 09:10

Louise Fallon


1 Answers

I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.

What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.

In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.

like image 120
Sim Avatar answered Oct 22 '25 03:10

Sim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!