Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep partition columns when reading in ORC files in Spark

When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have

val dfWithColumn = spark.read.orc("/some/path") 

val dfWithoutColumn = spark.read.orc("/some/path/region_partition=1")

then dfWithColumn will have a region_partition column, but dfWithoutColumn will not. How can I specify that I want to include all columns, even if they're partitioned?

I am using spark 2.2 on scala.

EDIT: This is a re-usable Spark program that will take in its arguments from the command line; I want the program to work even if the user passes in a specific partition of a table instead of the whole table. So, using Dataset.filter is not an option.

like image 937
alexgbelov Avatar asked Oct 17 '22 13:10

alexgbelov


1 Answers

It is same as parquet.

Reference: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

df = spark.read.option("basePath", "file://foo/bar/")
         .orc("file://foo/bar/partition_column=XXX")

df has a 'partition_column' column.

like image 99
TaeKyung Yoo Avatar answered Oct 21 '22 09:10

TaeKyung Yoo