Prevent DataFrame.partitionBy() from removing partitioned columns from schema

Tags:

I am partitioning a DataFrame as follows:

df.write.partitionBy("type", "category").parquet(config.outpath)

The code gives the expected results (i.e. data partitioned by type & category). However, the "type" and "category" columns are removed from the data / schema. Is there a way to prevent this behaviour?

587

asked Mar 22 '16 20:03

Michael

2 Answers

I can think of one workaround, which is rather lame, but works.

import spark.implicits._  val duplicated = df.withColumn("_type", $"type").withColumn("_category", $"category") duplicated.write.partitionBy("_type", "_category").parquet(config.outpath)

I'm answering this question in hopes that someone would have a better answer or explanation than what I have (if OP has found a better solution), though, since I have the same question.

answered Sep 24 '22 22:09

Ivan Gozali

In general, Ivan's answer is a fine cludge. BUT...

If you are strictly reading and writing in spark, you can just use the basePath option when reading your data.

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#partition-discovery

By passing path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths.

Example:

     val dataset = spark       .read       .format("parquet")       .option("basePath", hdfsInputBasePath)       .load(hdfsInputPath)

answered Sep 22 '22 22:09

Robert Beatty

Related questions
                            
                                Android 6.0 (Marshmallow): How to play midi notes?
                            
                                DI constructor with optional parameters
                            
                                Are monad laws enforced in Haskell?
                            
                                Extension API internal error: org.powermock.api.extension.reporter.MockingFrameworkReporterFactoryImpl
                            
                                Algolia Search api returns maximum 1000 records while my total records are around 50000
                            
                                Spring - How to stream large multipart file uploads to database without storing on local file system [duplicate]
                            
                                Is Groovy knowledge required to understand Gradle?
                            
                                CNUI ERROR Contact view delayed appearance timed out
                            
                                What is JavaScript shorthand property? [duplicate]
                            
                                Import Error from cyptography.hazmat.bindings._constant_time import lib
                            
                                Finding entries containing a substring in a numpy array?
                            
                                Does deep nesting flexbox layout cause performance issue?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With