Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

overwriting a spark output using pyspark

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful

spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path) 

the mode=overwrite command is not successful

like image 981
Devesh Avatar asked Mar 08 '16 07:03

Devesh


People also ask

How does overwrite works in Spark?

Overwrite. Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.

What is the difference between append and overwrite in Pyspark?

Append - the saved DataFrame is appended to already existent location. Overwrite - the files in the already existent location will be replaced by the new content.

Can you edit the contents of an existing Spark DataFrame?

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

What does .collect do Pyspark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


2 Answers

Try:

spark_df.write.format('com.databricks.spark.csv') \   .mode('overwrite').option("header", "true").save(self.output_file_path) 
like image 162
user6022341 Avatar answered Sep 17 '22 23:09

user6022341


Spark 1.4 and above has a built in csv function for the dataframewriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t") 

Which is syntactic sugar for

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path) 

I think what is confusing is finding where exactly the options are available for each format in the docs.

These write related methods belong to the DataFrameWriter class: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csv method has these options available, also available when using format("csv"): https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value) tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

For example The option(key, value) method takes one option as a tuple like option(header,"true") and the .options(**options) method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

EDIT 2021

The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.

DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output

The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

like image 30
Davos Avatar answered Sep 18 '22 23:09

Davos