Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

I have a dataframe and a i am going to write it an a .csv file in S3 i use the following code:

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)

it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?

like image 934
chessosapiens Avatar asked Oct 28 '16 12:10

chessosapiens


People also ask

How do I read a CSV file in S3 PySpark?

format("csv"). load("path") you can read a CSV file from Amazon S3 into a Spark DataFrame, Thes method takes a file path to read as an argument. By default read method considers header as a data record hence it reads column names on file as data, To overcome this we need to explicitly mention “true” for header option.


1 Answers

All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly

and one file inside called

part-00000

In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!

Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.

If you have 100 partitions, you will get:

part-00000
part-00001
...
part-00099

If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:

cat ${dir}.part-* > $flatFilePath
like image 138
Tim Avatar answered Oct 26 '22 18:10

Tim