How do I use Spark ORC indexes?

Tags:

apache-spark

orc

What is the option to enable orc indexing from spark?

          df
            .write()
            .option("mode", "DROPMALFORMED")
            .option("compression", "snappy")
            .mode("overwrite")
            .format("orc")
            .option("index", "user_id")
            .save(...);

I'm making up .option("index", uid), what would I have to put there to index column "user_id" from orc.

819

asked Oct 29 '17 21:10

ForeverConfused

2 Answers

Have you tried : .partitionBy("user_id") ?

 df
        .write()
        .option("mode", "DROPMALFORMED")
        .option("compression", "snappy")
        .mode("overwrite")
        .format("orc")
        .partitionBy("user_id")
        .save(...)

147

answered Sep 30 '22 09:09

Malik Fassi

According to the original blogpost on bringing ORC support to Apache Spark, there is a configuration knob to turn on in your spark context to enable ORC indexes.

# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

Reference: https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html

answered Sep 30 '22 10:09

louis_guitton

Related questions
                            
                                spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
                            
                                Saving Spark DataFrames with nested User Data Types
                            
                                Create Custom Cross Validation in Spark ML
                            
                                Spark Connector error: WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead
                            
                                Why won't this Spark sample code load in spark-shell?
                            
                                too many map keys causing out of memory exception in spark
                            
                                How to improve my recommendation result? I am using spark ALS implicit
                            
                                How to serialize a pyspark Pipeline object?
                            
                                Can I create an RDD from a kafka topic if I do not know the until offset?
                            
                                How to Set spark.sql.parquet.output.committer.class in pyspark
                            
                                Performance of loading parquet files into case classes in Spark
                            
                                PySpark how to read file having string with multiple encoding
                            
                                Why does SparkSQL require two literal escape backslashes in the SQL query?
                            
                                Timestamp roundtrip from Spark Python to Pandas and back
                            
                                Load a file from SFTP server into spark RDD
                            
                                Structured Streaming - Foreach Sink
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                Why can't I display prediction column of Spark MultilayerPerceptronClassifier?
                            
                                How to add hbase-site.xml config file using spark-shell
                            
                                Re-run Spark jobs on Failure or Abort

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With