I am currently using pyspark on a local windows 10 system. The pyspark code runs quite fast but takes a lot of time to save the pyspark dataframe to a csv format.
I am converting the pyspark dataframe to pandas and then saving it to a csv file. I have also tried using the write method to save the csv file.
Full_data.toPandas().to_csv("Level 1 - {} Hourly Avg Data.csv".format(yr), index=False)
Full_data.repartition(1).write.format('com.databricks.spark.csv').option("header", "true").save("Level 1 - {} Hourly Avg Data.csv".format(yr))
Both codes took about an hour to save the csv files. Is there a faster way to save csv files from pyspark dataframe?
In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.
Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.
In both the reported examples you are reducing the level of parallelism.
In the 1st example (toPandas
) computationally speaking is like calling the function collect()
. You gather the dataframe into a collection into the driver making it single threaded.
In the 2nd example you are calling repartition(1)
which reduces the level of parallelism to 1, making it again single threaded.
Try instead to use repartition(2)
(or 4 or 8... according to the number of available execution threads of your machine). That should produce quicker results leveraging Spark parallelism (even though it will split the result into multiple files, in equal number of the repartition factor).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With