Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalidate metadata/refresh imapala from spark code

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.

Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.

What would be the most efficient approach?

  • Oozie is just too slow (30 sec overhead? no thanks)
  • An SSH action to an (edge) node seems like a valid solution but feels "hackish"
  • I don't see a way to do this from the hive context in Spark either.
like image 468
Havnar Avatar asked Jul 06 '16 09:07

Havnar


Video Answer


1 Answers

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)

You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:

  • download the latest Cloudera JDBC driver for Impala
  • install it on the server where you run your Spark job
  • list all the JARs in your *.*.extraClassPath properties
  • develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way

Hopefully Google will find some examples of JDBC/Scala code such as this one

like image 147
Samson Scharfrichter Avatar answered Nov 16 '22 07:11

Samson Scharfrichter