Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.

Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.

Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
    at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
    at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
    at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
    at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
    at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
like image 343
KLA Avatar asked Oct 27 '22 21:10


1 Answers

I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources

enter image description here

like image 183
gip Avatar answered Oct 29 '22 14:10
