I am using the databricks spark cluster (AWS), and testing on my scala experiment. I have some issue when training on a 10 GB data with LogisticRegressionWithLBFGS algorithm. The code block where I met the issue is as follows:
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
val algorithm = new LogisticRegressionWithLBFGS()
algorithm.run(training_set)
First I got a lot executor lost failure and java out of memory issues, then I repartitioned my training_set with more partitions and the out of memory issues are gone, but Still get executor lost failure.
My cluster has 72 cores and 500GB ram in total. Can any one give some idea on this?
If an executor runs into memory issues, it will fail the task and restart where the last task left off. If that task fails after 3 retries (4 attempts total by default) then that Stage will fail and cause the Spark job as a whole to fail.
Here the best solution to this problem is to use yarn and set –conf spark. yarn. executor. memoryOverhead=600, alternatively when cluster using mesos, try this –conf spark.
A Spark application removes an executor when it has been idle for more than spark. dynamicAllocation.
LBFGS uses dense vector for storing betas (feature weights) internally and everything is in memory. So regardless of sparsity of features in training set, the total count of features is something to be mindful about.
So to solve this user should either increase executor memory or limit total count of features in the training set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With