How to overwrite the output directory in spark

Tags:

apache-spark

People also ask

How do I overwrite a parquet file?

You can also use df. write. mode(mode: String). parquet(path) Where mode: String can be: "overwrite", "append", "ignore", "error".

Which save mode will raise an error message if data already exists in target path?

ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.

What is truncate false in Spark?

The following answer applies to a Spark Streaming application. By setting the "truncate" option to false, you can tell the output sink to display the full column.

UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....

Handy pimp:

implicit class PimpedStringRDD(rdd: RDD[String]) {
    def write(p: String)(implicit ss: SparkSession): Unit = {
      import ss.implicits._
      rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
    }
  }

For older versions try

yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)

In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.

WARNING (older versions): According to @piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.

since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)

use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter

'source' can be ("com.databricks.spark.avro" | "parquet" | "json")

From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:

myDataFrame.save(path='myPath', source='parquet', mode='overwrite')

I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.

See the Spark SQL documentation for more information about the mode options.

The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.

You could do this before saving the file:

val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }

Aas explained here: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html

df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions

  val jobName = "WordCount";
  //overwrite the output directory in spark  set("spark.hadoop.validateOutputSpecs", "false")
  val conf = new 
  SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
  val sc = new SparkContext(conf)

Related questions
                            
                                Renaming column names of a DataFrame in Spark Scala
                            
                                Apache Spark: How to use pyspark with Python 3
                            
                                Spark Error - Unsupported class file major version
                            
                                How to tune spark executor number, cores and executor memory?
                            
                                What does "Stage Skipped" mean in Apache Spark web UI?
                            
                                Convert pyspark string to date format
                            
                                Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?
                            
                                Best way to get the max value in a Spark dataframe column
                            
                                java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7
                            
                                Extract column values of Dataframe as List in Apache Spark
                            
                                How to create an empty DataFrame with a specified schema?
                            
                                Can apache spark run without hadoop?
                            
                                Spark Dataframe distinguish columns with duplicated name
                            
                                What do the numbers on the progress bar mean in spark-shell?
                            
                                Spark - Error "A master URL must be set in your configuration" when submitting an app
                            
                                Spark DataFrame groupBy and sort in the descending order (pyspark)
                            
                                How to load local file in sc.textFile, instead of HDFS
                            
                                Load CSV file with Spark
                            
                                How to kill a running Spark application?
                            
                                How to delete columns in pyspark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to overwrite the output directory in spark

Tags:

apache-spark

People also ask

Related questions

Recent Activity

Donate For Us