Spark, Incorrect behaviour when throwing SparkException in EMR

Tags:

I'm running a spark job in EMR with YARN as resource manager and on 2 nodes. I need to purposefully fail the step if my condition is not met, so the next step doesn't execute as per the configuration. To achieve this I'm throwing a custom exception, after inserting a log message in dynamoDB.

It runs fine but the record in Dynamo is getting inserted twice.

Below is my code.

if(<condition>) {
  <method call to insert in dynamo> 
  throw new SparkException(<msg>);
  return;
}

If I remove the line to throw exception, it works fine but the step is completed.

How can I make the step fail, without getting the log message twice.

Thanks for the help.

Regards, Sorabh

220

asked Sep 26 '17 06:09

Sorabh Kumar

1 Answers

Probably the reason your dynamo message was inserted twice was because your error condition was hit and processed by two different executors. Spark is dividing up the work to be done among it's workers, and those workers don't share any knowledge.

I'm not sure what is driving your requirement to have the Spark step FAIL, but I would suggest instead tracking that failure case in your application code instead of trying to have spark die directly. In other words, write code that detects the error and passes that back to your spark driver, then act on it as appropriate.

One way to do this would be to use an accumulator to count any errors that occur as you are processing your data. It would look something roughly like this (I'm assuming scala and DataFrames, but you can adapt to RDD's and/or python as needed):

val accum = sc.longAccumulator("Error Counter")
def doProcessing(a: String, b: String): String = {
   if(condition) {
     accum.add(1)
     null
   }
   else {
     doComputation(a, b)
   }
}
val doProcessingUdf = udf(doProcessing _)

df = df.withColumn("result", doProcessing($"a", $"b"))

df.write.format(..).save(..)  // Accumulator value not computed until an action occurs!

if(accum.value > 0) {
    // An error detected during computation! Do whatever needs to be done.
    <insert dynamo message here>
}

One nice thing about this approach is if you are looking for feedback in the Spark UI you will be able to see the accumulator values there while it is running. For reference, here is the documentation on accumulators: http://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

145

answered Nov 11 '22 17:11

Ryan Widmaier

Related questions
                            
                                Writing an RDD to multiple files in PySpark
                            
                                Can sample weight be used in Spark MLlib Random Forest training?
                            
                                Manually stopping Spark Workers
                            
                                Spark Streaming: Broadcast variables, java.lang.ClassCastException
                            
                                How to run custom Python script on Jupyter Notebook launch (to boot Spark)?
                            
                                saveToCassandra with spark-cassandra connector throws java.lang.ClassCastException
                            
                                How to load a PMML model?
                            
                                How to distribute xgboost module for use in spark?
                            
                                how to get two-hop neighbors in spark-graphx?
                            
                                How a Spark executor runs multiple tasks?
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?
                            
                                Slow Parquet write to HDFS using Spark
                            
                                Spark performance enhancements by storing sorted Parquet files
                            
                                Spark workers stopped after driver commanded a shutdown
                            
                                How to check if all records for a given key are in the same partition already?
                            
                                approxQuantile give incorrect Median in Spark (Scala)?
                            
                                Setting "spark.memory.storageFraction" in Spark does not work
                            
                                Method to get number of cores for a executor on a task node?
                            
                                Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark, Incorrect behaviour when throwing SparkException in EMR

Tags:

apache-spark

amazon-dynamodb

hadoop-yarn

amazon-emr

Sorabh Kumar

People also ask

1 Answers

Ryan Widmaier

Recent Activity

Donate For Us