I am currently using pyspark on a local windows 10 system. The pyspark code runs quite fast but takes a lot of time to save the pyspark dataframe to a csv format. I am converting the pyspark dataframe to pandas and then saving it to a csv file. I have also tried using the write method to save the csv file. <pre class="prettyprint"><code>Full_data.toPandas().to_csv("Level 1 - {} Hourly Avg Data.csv".format(yr), index=False) Full_data.repartition(1).write.format('com.databricks.spark.csv').option("header", "true").save("Level 1 - {} Hourly Avg Data.csv".format(yr)) </code></pre> Both codes took about an hour to save the csv files. Is there a faster way to save csv files from pyspark dataframe?

In both the reported examples you are reducing the level of parallelism. In the 1st example (<code>toPandas</code>) computationally speaking is like calling the function <code>collect()</code>. You gather the dataframe into a collection into the driver making it single threaded. In the 2nd example you are calling <code>repartition(1)</code> which reduces the level of parallelism to 1, making it again single threaded. Try instead to use <code>repartition(2)</code> (or 4 or 8... according to the number of available execution threads of your machine). That should produce quicker results leveraging Spark parallelism (even though it will split the result into multiple files, in equal number of the repartition factor).

How to save csv files faster from pyspark dataframe?

Tags:

python

apache-spark

hadoop

pyspark

I am currently using pyspark on a local windows 10 system. The pyspark code runs quite fast but takes a lot of time to save the pyspark dataframe to a csv format.

I am converting the pyspark dataframe to pandas and then saving it to a csv file. I have also tried using the write method to save the csv file.

Full_data.toPandas().to_csv("Level 1 - {} Hourly Avg Data.csv".format(yr), index=False)




Full_data.repartition(1).write.format('com.databricks.spark.csv').option("header", "true").save("Level 1 - {} Hourly Avg Data.csv".format(yr))

Both codes took about an hour to save the csv files. Is there a faster way to save csv files from pyspark dataframe?

759

asked Aug 01 '19 14:08

Chinmay Dalvi

1 Answers

In both the reported examples you are reducing the level of parallelism.

In the 1st example (toPandas) computationally speaking is like calling the function collect(). You gather the dataframe into a collection into the driver making it single threaded.

In the 2nd example you are calling repartition(1) which reduces the level of parallelism to 1, making it again single threaded.

Try instead to use repartition(2) (or 4 or 8... according to the number of available execution threads of your machine). That should produce quicker results leveraging Spark parallelism (even though it will split the result into multiple files, in equal number of the repartition factor).

157

answered Sep 22 '22 13:09

Vzzarr

Related questions
                            
                                Pandas concat columns
                            
                                Numpy deep copy still altering original array
                            
                                Librosa's fft and Scipy's fft are different?
                            
                                Airflow - ModuleNotFoundError: No module named 'kubernetes'
                            
                                Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize
                            
                                How to use only one GPU for tensorflow session?
                            
                                pandas dataframe with list elements: split, pad
                            
                                disable `functools.lru_cache` from inside function
                            
                                How create a camera on PyOpenGL that can do "perspective rotations" on mouse movements?
                            
                                Why am I getting "The lock supplied is invalid." error when I am trying to delete queue message using LockTocken
                            
                                No python at '\python.exe'
                            
                                Pandas Dataframe How to cut off float decimal points without rounding?
                            
                                Pandas pd.to_datetime only keep time do not date
                            
                                Python Extract a decimal number before a specific substring
                            
                                Pandas - Row number since last greater than 0 value
                            
                                Count occurrences of a list of substrings in a pyspark df column
                            
                                Python Plotly display values'labels
                            
                                What is the type hint for a class reference?
                            
                                What does an empty string key for package_dir do in setup.py?
                            
                                Select rows that match values in multiple columns in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With