How to keep partition columns when reading in ORC files in Spark

Question

When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have

val dfWithColumn = spark.read.orc("/some/path") 

val dfWithoutColumn = spark.read.orc("/some/path/region_partition=1")

then dfWithColumn will have a region_partition column, but dfWithoutColumn will not. How can I specify that I want to include all columns, even if they're partitioned?

I am using spark 2.2 on scala.

EDIT: This is a re-usable Spark program that will take in its arguments from the command line; I want the program to work even if the user passes in a specific partition of a table instead of the whole table. So, using Dataset.filter is not an option.

TaeKyung Yoo · Accepted Answer

It is same as parquet.

Reference: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

df = spark.read.option("basePath", "file://foo/bar/")
         .orc("file://foo/bar/partition_column=XXX")

df has a 'partition_column' column.

How to keep partition columns when reading in ORC files in Spark

Tags:

apache-spark

apache-spark-sql

orc

alexgbelov

1 Answers

TaeKyung Yoo

Recent Activity

Donate For Us

How to keep partition columns when reading in ORC files in Spark

Tags:

apache-spark

apache-spark-sql

orc

alexgbelov

1 Answers

TaeKyung Yoo

Related questions

Recent Activity

Donate For Us