I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.
Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.
What would be the most efficient approach?
REFRESH
and INVALIDATE METADATA
commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH
of the list of files in each partition, not a wholesale INVALIDATE
to rebuild the list of all partitions and all their files from scratch)
You could use the Spark SqlContext
to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:
*.*.extraClassPath
propertiesREFRESH somedb.sometable
) -- the hard wayHopefully Google will find some examples of JDBC/Scala code such as this one
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With