When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have
val dfWithColumn = spark.read.orc("/some/path")
val dfWithoutColumn = spark.read.orc("/some/path/region_partition=1")
then dfWithColumn will have a region_partition column, but dfWithoutColumn will not. How can I specify that I want to include all columns, even if they're partitioned?
I am using spark 2.2 on scala.
EDIT: This is a re-usable Spark program that will take in its arguments from the command line; I want the program to work even if the user passes in a specific partition of a table instead of the whole table. So, using Dataset.filter is not an option.
It is same as parquet.
Reference: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
df = spark.read.option("basePath", "file://foo/bar/")
.orc("file://foo/bar/partition_column=XXX")
df has a 'partition_column' column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With