In my spark application, there is an <code>object ResourceFactory</code> which contains an akka <code>ActorSystem</code> for providing resource clients. So when I run this spark application, every worker node will create an <code>ActorSystem</code>. The problem is that when the spark application finishes its works and gets shutdown. The <code>ActorSystem</code> still keeps alive on every worker node and prevents the whole application to terminate, it's just hung on. Is there a way to register some listener to the <code>SparkContext</code> so that when the <code>sc</code> gets shutdown, then the <code>ActorSystem</code> on every worker node will get notified to shutdown themselves? <hr> UPDATE: Following is the simplified skeleton: There is a <code>ResourceFactory</code>, which is an <code>object</code> and it contains an <code>actor system</code>. And it also provides a <code>fetchData</code> method. <pre class="prettyprint"><code>object ResourceFactory{ val actorSystem = ActorSystem("resource-akka-system") def fetchData(): SomeData = ... } </code></pre> And then, there is a <code>user-defined RDD</code> class, in its <code>compute</code> method, it needs to fetch data from the <code>ResourceFactory</code>. <pre class="prettyprint"><code>class MyRDD extends RDD[SomeClass] { override def compute(...) { ... ResourceFactory.fetchData() ... someIterator } } </code></pre> So on every node there will be one <code>ActorSystem</code> named "resource-akka-system", and those <code>MyRDD</code> instances distributed on those worker nodes can get data from the "resource-akka-system". The problem is that, when the <code>SparkContext</code> gets shutdown, there is no need for those "resource-akka-system"s, but I don't know how to notify the <code>ResourceFactory</code> to shutdown the "resource-akka-system" when the <code>SparkContext</code> gets shutdown. So now, the "resouce-akka-system" keeps alive on each worker node and prevents the whole program to exit. <hr> UPDATE2: With some more experiments, I find that in local mode the program is hung on, but in <code>yarn-cluster</code> mode, the program will exit successfully. May be this is because <code>yarn</code> will kill the threads on worker nodes when the <code>sc</code> is shutdown? <hr> UPDATE3: To check whether every node contains an <code>ActorSystem</code>, I change the code as following(following is the real skeleton, as I add another class definition): <pre class="prettyprint"><code>object ResourceFactory{ println("creating resource factory") val actorSystem = ActorSystem("resource-akka-system") def fetchData(): SomeData = ... } class MyRDD extends RDD[SomeClass] { println("creating my rdd") override def compute(...) { new RDDIterator(...) } } class RDDIterator(...) extends Iterator[SomeClass] { println("creating rdd iterator") ... lazy val reader = { ... ResourceFactory.fetchData() ... } ... override next() = { ... reader.xx() } } </code></pre> After adding those <code>println</code>s, I run the code on spark on yarn-cluster mode. I find that on the driver I have following prints: <pre class="prettyprint"><code>creating my rdd creating resource factory creating my rdd ... </code></pre> While on some of the workers, I have following prints: <pre class="prettyprint"><code>creating rdd iterator creating resource factory </code></pre> And some of the workers, it prints nothing (and all of them are not assigned any tasks). Based on the above, I think the <code>object</code> is initialized in driver eagerly, since it prints <code>creating resource factory</code> on the driver even when no thing refers to it, and <code>object</code> is initialized in worker lazily because it prints <code>creating resource factory</code> after printing <code>creating rdd iterator</code> as resource factory is lazily referenced by the first created RDDIterator. And I find that in my use case the <code>MyRDD</code> class is only created in the driver. I am not very sure about the laziness of the initialization of the <code>object</code> on driver and worker, it's my guess, because maybe it's caused by other part of the program to make it looks like that. But I think it should be right that there is one actor system on each worker node when it is necessary.

I don't think that there is a way to tap into each <code>Worker</code> lifecycle. Also I have some questions regarding your implementation: <ol> <li>If you have <code>object</code> that contains <code>val</code>, that is used from function run on worker, my understanding is that this <code>val</code> gets serialized and broadcasted to worker. Can you confirm, that you have one ActorSystem running per worker?</li> <li>Actor System usually terminated immediately if you don't explicitly wait for it's termination. Are you calling something like <code>system.awaitTermination</code> or blocking on <code>system.whenTerminated</code>?</li> </ol> Anyway, there is another way, how you can shutdown actor systems on remote workers: <ol> <li>Make your ActorSystem on each node part of the akka cluster. Here are some docs how to do that programmatically.</li> <li>Have address of your "coordination" Actor on driver node (where your <code>sc</code> is) broadcasted to each worker. In simple words, just have <code>val</code> with that address.</li> <li>When your akka system is started on each worker use that "coordination" Actor address to register this particular actor system (send corresponding message to coordination Actor).</li> <li>Coordination Actor keeps track of all registered "worker" Actors</li> <li>When your computation is completed and you want to shut down Akka system on every worker, send messages to all registered Actors from coordination Actor on driver node.</li> <li>Shutdown on worker Akka systems when "shutdown" message is received.</li> </ol>

How to clean up other resources when spark gets stopped

Tags:

scala

akka

apache-spark

In my spark application, there is an object ResourceFactory which contains an akka ActorSystem for providing resource clients. So when I run this spark application, every worker node will create an ActorSystem. The problem is that when the spark application finishes its works and gets shutdown. The ActorSystem still keeps alive on every worker node and prevents the whole application to terminate, it's just hung on.

Is there a way to register some listener to the SparkContext so that when the sc gets shutdown, then the ActorSystem on every worker node will get notified to shutdown themselves?

UPDATE:

Following is the simplified skeleton:

There is a ResourceFactory, which is an object and it contains an actor system. And it also provides a fetchData method.

object ResourceFactory{
  val actorSystem = ActorSystem("resource-akka-system")
  def fetchData(): SomeData = ...
}

And then, there is a user-defined RDD class, in its compute method, it needs to fetch data from the ResourceFactory.

class MyRDD extends RDD[SomeClass] {
  override def compute(...) {
    ...
    ResourceFactory.fetchData()
    ...
    someIterator
  }
}

So on every node there will be one ActorSystem named "resource-akka-system", and those MyRDD instances distributed on those worker nodes can get data from the "resource-akka-system".

The problem is that, when the SparkContext gets shutdown, there is no need for those "resource-akka-system"s, but I don't know how to notify the ResourceFactory to shutdown the "resource-akka-system" when the SparkContext gets shutdown. So now, the "resouce-akka-system" keeps alive on each worker node and prevents the whole program to exit.

UPDATE2:

With some more experiments, I find that in local mode the program is hung on, but in yarn-cluster mode, the program will exit successfully. May be this is because yarn will kill the threads on worker nodes when the sc is shutdown?

UPDATE3:

To check whether every node contains an ActorSystem, I change the code as following(following is the real skeleton, as I add another class definition):

object ResourceFactory{
  println("creating resource factory")
  val actorSystem = ActorSystem("resource-akka-system")
  def fetchData(): SomeData = ...
}

class MyRDD extends RDD[SomeClass] {
  println("creating my rdd")
  override def compute(...) {
    new RDDIterator(...)
  }
}

class RDDIterator(...) extends Iterator[SomeClass] {
  println("creating rdd iterator")
  ...
  lazy val reader = {
    ...
    ResourceFactory.fetchData()
    ...
  }
  ...
  override next() = {
    ...
    reader.xx()
  }
}

After adding those printlns, I run the code on spark on yarn-cluster mode. I find that on the driver I have following prints:

creating my rdd
creating resource factory
creating my rdd
...

While on some of the workers, I have following prints:

creating rdd iterator
creating resource factory

And some of the workers, it prints nothing (and all of them are not assigned any tasks).

Based on the above, I think the object is initialized in driver eagerly, since it prints creating resource factory on the driver even when no thing refers to it, and object is initialized in worker lazily because it prints creating resource factory after printing creating rdd iterator as resource factory is lazily referenced by the first created RDDIterator.

And I find that in my use case the MyRDD class is only created in the driver.

I am not very sure about the laziness of the initialization of the object on driver and worker, it's my guess, because maybe it's caused by other part of the program to make it looks like that. But I think it should be right that there is one actor system on each worker node when it is necessary.

938

asked Apr 13 '16 10:04

宇宙人

1 Answers

I don't think that there is a way to tap into each Worker lifecycle.

Also I have some questions regarding your implementation:

If you have object that contains val, that is used from function run on worker, my understanding is that this val gets serialized and broadcasted to worker. Can you confirm, that you have one ActorSystem running per worker?
Actor System usually terminated immediately if you don't explicitly wait for it's termination. Are you calling something like system.awaitTermination or blocking on system.whenTerminated?

Anyway, there is another way, how you can shutdown actor systems on remote workers:

Make your ActorSystem on each node part of the akka cluster. Here are some docs how to do that programmatically.
Have address of your "coordination" Actor on driver node (where your sc is) broadcasted to each worker. In simple words, just have val with that address.
When your akka system is started on each worker use that "coordination" Actor address to register this particular actor system (send corresponding message to coordination Actor).
Coordination Actor keeps track of all registered "worker" Actors
When your computation is completed and you want to shut down Akka system on every worker, send messages to all registered Actors from coordination Actor on driver node.
Shutdown on worker Akka systems when "shutdown" message is received.

answered Oct 28 '22 23:10

Aivean

Related questions
                            
                                The object-functional impedance mismatch
                            
                                Akka-streams: how to get flow names in metrics reported by kamon-akka
                            
                                Implementing a Cake Pattern with implicit functionality
                            
                                No implementation for OWrites and Reads was bound in Scala Play app
                            
                                Including a Spark Package JAR file in a SBT generated fat JAR
                            
                                Cannot infer contravariant Nothing type parameter
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                How to separate out parsing from validation in case of versioned config using scala?
                            
                                Proper way to access shared resource in Scala actors
                            
                                How to specialize on a type projection in Scala?
                            
                                Is there a quick way to convert Java xml objects to Scala xml objects?
                            
                                Lift Ajax multi select box
                            
                                How to handle exceptions in a playframework 2 Async block (scala)
                            
                                How to add custom IntelliJ Language Injection to scala string-interpolation?
                            
                                How to use Scala ARM with Futures?
                            
                                Apache Spark: distinct doesnt work?
                            
                                Advantages of using ScalaFutures from ScalaTest vs. Await.result
                            
                                How to do time-series simple forecast?
                            
                                Forcing all implementations of a trait to override equals
                            
                                Event-sourcing with akka-persistance: growing state as list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With