I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful
spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path)
the mode=overwrite command is not successful
Overwrite. Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
Append - the saved DataFrame is appended to already existent location. Overwrite - the files in the already existent location will be replaced by the new content.
Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
Try:
spark_df.write.format('com.databricks.spark.csv') \ .mode('overwrite').option("header", "true").save(self.output_file_path)
Spark 1.4 and above has a built in csv function for the dataframewriter
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
e.g.
spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")
Which is syntactic sugar for
spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)
I think what is confusing is finding where exactly the options are available for each format in the docs.
These write related methods belong to the DataFrameWriter
class: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
The csv
method has these options available, also available when using format("csv")
: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv
The way you need to supply parameters also depends on if the method takes a single (key, value)
tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.
For example The option(key, value)
method takes one option as a tuple like option(header,"true")
and the .options(**options)
method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")
The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.
DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output
The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With