Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save csv files faster from pyspark dataframe?

I am currently using pyspark on a local windows 10 system. The pyspark code runs quite fast but takes a lot of time to save the pyspark dataframe to a csv format.

I am converting the pyspark dataframe to pandas and then saving it to a csv file. I have also tried using the write method to save the csv file.

Full_data.toPandas().to_csv("Level 1 - {} Hourly Avg Data.csv".format(yr), index=False)




Full_data.repartition(1).write.format('com.databricks.spark.csv').option("header", "true").save("Level 1 - {} Hourly Avg Data.csv".format(yr))

Both codes took about an hour to save the csv files. Is there a faster way to save csv files from pyspark dataframe?

like image 759
Chinmay Dalvi Avatar asked Aug 01 '19 14:08

Chinmay Dalvi


People also ask

How do I save a DataFrame as a CSV file in PySpark?

In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.

Is PySpark always faster than pandas?

Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.


1 Answers

In both the reported examples you are reducing the level of parallelism.

In the 1st example (toPandas) computationally speaking is like calling the function collect(). You gather the dataframe into a collection into the driver making it single threaded.

In the 2nd example you are calling repartition(1) which reduces the level of parallelism to 1, making it again single threaded.

Try instead to use repartition(2) (or 4 or 8... according to the number of available execution threads of your machine). That should produce quicker results leveraging Spark parallelism (even though it will split the result into multiple files, in equal number of the repartition factor).

like image 157
Vzzarr Avatar answered Sep 22 '22 13:09

Vzzarr