I am trying to export data from a spark dataframe to .csv file:
df.coalesce(1)\
.write\
.format("com.databricks.spark.csv")\
.option("header", "true")\
.save(output_path)
It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"
I want the filename to be "part-r-00000.csv" or "part-00000.csv"
As the file is being created on AWS S3, I am limited in how I can use os.system commands.
How can I set the file name while keeping the header in the file?
Thanks!
In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
When you are ready to write a DataFrame, first use Spark repartition() and coalesce() to merge data from all partitions into a single partition and then save it to a file. This still creates a directory and write a single part file inside a directory instead of multiple part files.
Coalesce uses existing partitions to minimize the amount of data that are shuffled. Repartition creates new partitions and does a full shuffle. Coalesce results in partitions with different amounts of data (sometimes partitions that have many different sizes) and repartition results in roughly equal-sized partitions.
Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.
Coming to the solution:
When we create a .csv file from spark dataframe,
The output file is by default named part-x-yyyyy where:
1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.
In order to rename the output file, running an os.system HDFS command should do.
import os, sys
output_path_stage = //set the source folder path here
output_path = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + ' ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)
fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path)
, file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With