overwriting a spark output using pyspark

Tags:

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful

spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path)

the mode=overwrite command is not successful

981

asked Mar 08 '16 07:03

Devesh

2 Answers

Try:

spark_df.write.format('com.databricks.spark.csv') \   .mode('overwrite').option("header", "true").save(self.output_file_path)

162

answered Sep 17 '22 23:09

user6022341

Spark 1.4 and above has a built in csv function for the dataframewriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")

Which is syntactic sugar for

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)

I think what is confusing is finding where exactly the options are available for each format in the docs.

These write related methods belong to the DataFrameWriter class: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csv method has these options available, also available when using format("csv"): https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value) tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

For example The option(key, value) method takes one option as a tuple like option(header,"true") and the .options(**options) method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

EDIT 2021

The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.

DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output

The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

answered Sep 18 '22 23:09

Davos

Related questions
                            
                                Using Cython with Django. Does it make sense?
                            
                                How to run PyCharm in Ubuntu - "Run in Terminal" or "Run"?
                            
                                Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
                            
                                Sampling uniformly distributed random points inside a spherical volume
                            
                                How to get list of all variables in jinja 2 templates
                            
                                Execute .sql schema in psycopg2 in Python
                            
                                What's the best way to split a string into fixed length chunks and work with them in Python?
                            
                                Seaborn RegPlot Partially See Through (alpha)
                            
                                Anaconda Installed but Cannot Launch Navigator
                            
                                Finding max value in the second column of a nested list?
                            
                                Reverse a string in Python two characters at a time (Network byte order)
                            
                                Benefits of panda's multiindex?
                            
                                Any way to override the and operator in Python?
                            
                                Grep on elements of a list
                            
                                scipy.stats seed?
                            
                                Python Untokenize a sentence
                            
                                How to calculate CRC32 with Python to match online results?
                            
                                How to set connection timeout in SQLAlchemy
                            
                                retrieving list items from request.POST in django/python
                            
                                How can I create a standard colorbar for a series of plots in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

overwriting a spark output using pyspark

Tags:

python

apache-spark

pyspark