I am using Spark to write out data into partitions. Given a dataset with two columns <code>(foo, bar)</code>, if I do <code>df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output")</code>, I get an output of <pre class="prettyprint"><code>/tmp/output/foo=1/X.csv /tmp/output/foo=2/Y.csv ... </code></pre> However, the output CSV files only contain the value for <code>bar</code>, not <code>foo</code>. I know the value of <code>foo</code> is already captured in the directory name <code>foo=N</code>, but is it possible to also include the value of <code>foo</code> in the CSV file?

Only if you make a copy under different name: <pre class="prettyprint"><code>(df .withColumn("foo_", col("foo")) .write.mode("overwrite") .format("csv").partitionBy("foo_").save("/tmp/output")) </code></pre>

Spark: can you include partition columns in output files?

Tags:

apache-spark

hadoop-partitioning

I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar), if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output"), I get an output of

/tmp/output/foo=1/X.csv
/tmp/output/foo=2/Y.csv
...

However, the output CSV files only contain the value for bar, not foo. I know the value of foo is already captured in the directory name foo=N, but is it possible to also include the value of foo in the CSV file?

406

asked Jan 10 '18 14:01

erwaman

1 Answers

Only if you make a copy under different name:

(df
    .withColumn("foo_", col("foo"))
    .write.mode("overwrite")
    .format("csv").partitionBy("foo_").save("/tmp/output"))

182

answered Oct 04 '22 19:10

Alper t. Turker

Related questions
                            
                                Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
                            
                                How to exclude jar in final sbt assembly plugin
                            
                                How can I tell if my spark job is progressing?
                            
                                Difference between spark-submit vs. SparkSession in python script?
                            
                                Spark ML Pipeline with RandomForest takes too long on 20MB dataset
                            
                                Understanding DAG in spark
                            
                                Databricks display() function equivalent or alternative to Jupyter
                            
                                PySpark dataframe to_json() function
                            
                                How to run two spark jobs in parallel in standalone mode [duplicate]
                            
                                Spark - Reading many small parquet files gets status of each file before hand
                            
                                How to let pyspark display the whole query plan instead of ... if there are many fields?
                            
                                Does reducing the number of executor-cores consume less executor-memory?
                            
                                Spark policy for handling multiple watermarks
                            
                                Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?
                            
                                Spark 1.6: filtering DataFrames generated by describe()
                            
                                Does registerTempTable cause the table to get cached?
                            
                                What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?
                            
                                Error in running Spark in Intellij : "object apache is not a member of package org"
                            
                                How can I print nulls when converting a dataframe to json in Spark
                            
                                SparkSession initialization error - Unable to use spark.read

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With