Exporting spark dataframe to .csv with header and specific filename

Tags:

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

821

asked Feb 06 '18 21:02

Naresh Y

1 Answers

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

176

answered Oct 18 '22 00:10

Naresh Y

Related questions
                            
                                Pandas - Explanation on apply function being slow
                            
                                how to throttle a large number of tasks without using all workers
                            
                                What is calculator mode?
                            
                                Is pandas.DataFrame.groupby Guaranteed To Be Stable?
                            
                                Python button functions oddly not doing the same
                            
                                opencv cvtColor dtype issue(error: (-215) )
                            
                                Putting two keys with the same hash into a dict
                            
                                Tracing Python expression evaluation step by step
                            
                                NumPy: consequences of using 'np.save()' with 'allow_pickle=False'
                            
                                Scrape inner frame HTML
                            
                                Pandas merge creates unwanted duplicate entries
                            
                                Is it safe to store my 'next' url in a signed cookie and redirect to it carefree?
                            
                                How can I set a default per test timeout in pytest?
                            
                                What is the most efficient way to copy an externally provided buffer to bytes
                            
                                How can one uninstall virtualenvwrapper?
                            
                                psycopg2.DatabaseError: SSL SYSCALL error: Connection timed out
                            
                                Selenium/python: extract text from a dynamically-loading webpage after every scroll
                            
                                Where is pip's cache in a virtualenv?
                            
                                Building Progressive Web Apps using Python Flask
                            
                                Running python script with Numpy and OpenCV on Android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Exporting spark dataframe to .csv with header and specific filename

Tags:

python

export-to-csv

apache-spark

pyspark

databricks

Naresh Y

People also ask

1 Answers

Naresh Y

Recent Activity

Donate For Us