Using partitionBy on a DataFrameWriter writes directory layout with column names not just values

Tags:

I am using Spark 2.0.

I have a DataFrame. My code looks something like the following:

df.write.partitionBy("year", "month", "day").format("csv").option("header", "true").save(s"s3://bucket/")

And when the program executes, it writes files in the following format:

Click to copy

s3://bucket/year=2016/month=11/day=15/file.csv

How do I configure the format to be like this:

Click to copy

s3://bucket/2016/11/15/file.csv

I would also like to know if it is possible to configure the filename.

Here is the relevant documentation that seems pretty sparse...
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

Click to copy

partitionBy(colNames: String*): DataFrameWriter[T]
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/
year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.

971

asked Nov 15 '16 23:11

satoukum

2 Answers

This is expected and desired behavior. Spark uses directory structure for partition discovery and pruning and the correct structure, including column names, is necessary for it to work.

You also have to remember that partitioning drops the columns used for partitioning.

If you need specific directory structure you should use downstream process to rename directories.

answered Oct 10 '22 00:10

user7723061

You can use the following script to relayout the directories's name:

Click to copy

#!/usr/bin/env bash

# Rename repartition folder: delete COLUMN=, e.g. DATE=20170708 to 20170708.

path=$1
col=$2
for f in `hdfs dfs -ls $ | awk '{print $NF}' | grep $col=`; do
    a="$(echo $f | sed s/$col=//)"
    hdfs dfs -mv "$f" "$a"
done

answered Oct 10 '22 01:10

Duong Nguyen

Related questions
                            
                                Specific parameter type in Subclass not possible
                            
                                scala build with SBT cannot import java classes?
                            
                                Android, scala and eclipse = unstable blend
                            
                                Kinds not conforming with type lambda
                            
                                this.getClass usage in Scala traits
                            
                                Accessing position information in a scala combinatorparser kills performance
                            
                                What would cause Intellij to suddenly get really slow with scala?
                            
                                don't understand scalaz endo function
                            
                                How to interrupt asynchronous computations in Scala?
                            
                                Constrain a class with implicit evidence
                            
                                scala - Analog of Haskell's sequence [duplicate]
                            
                                play framework migrate to 2.1.1 gives me a headache
                            
                                Boilerplate-free Scala ArrayBuilder specialization
                            
                                Overriding Companion Object Values and Scala MatchError
                            
                                Implicit conversion not working
                            
                                Get defined parameters for partial function
                            
                                Did Scala case class annotations change in 2.10?
                            
                                How do I link variables inside another object in scaladocs?
                            
                                Convert List of Ints to a SortedSet in Scala
                            
                                Apache Spark: network errors between executors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using partitionBy on a DataFrameWriter writes directory layout with column names not just values

Tags:

configuration

scala

apache-spark

spark-dataframe

satoukum

People also ask

2 Answers

user7723061

Duong Nguyen

Recent Activity

Donate For Us