What is the option to enable orc indexing from spark?
df
.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.mode("overwrite")
.format("orc")
.option("index", "user_id")
.save(...);
I'm making up .option("index", uid)
, what would I have to put there to index column "user_id" from orc.
For existing Hive tables, Spark can read them without createOrReplaceTempView . If the table is stored as ORC format (the default), predicate Push-down, partition pruning, and vectorized query execution are also applied according to the configuration.
Spark supports two ORC implementations ( native and hive ) which is controlled by spark. sql. orc. impl .
There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details. It supports complex data type like array, map, struct, etc.
Have you tried : .partitionBy("user_id")
?
df
.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.mode("overwrite")
.format("orc")
.partitionBy("user_id")
.save(...)
According to the original blogpost on bringing ORC support to Apache Spark, there is a configuration knob to turn on in your spark context to enable ORC indexes.
# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
Reference: https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With