Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis

I am facing these errors while running a spark job in standalone cluster mode.

My spark job aims at:

  • Running some groupby,
  • count,
  • and joins to get a final df and then df.toPandas().to_csv().

Input dataset is of 524 Mb. Error I get:

WARN TaskMemoryManager: Failed to allocate a page (33554432 bytes), try again.

After multiple times repeating the above , again new error

  1. WARN NettyRpcEnv: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply in 10 seconds

  2. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval

  3. at org.apache.spark.rpc.RpcTimeout. org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException

  4. ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 158295 ms

  5. Exception happened during processing of request from ('127.0.0.1', 49128) Traceback (most recent call last):

    File "/home/stp/spark-2.0.0-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle num_updates = read_int(self.rfile) File "/home/stp/spark-2.0.0-bin-hadoop2.7/python/pyspark/serializers.py", line 545, in read_int raise EOFError EOFError

  6. At last ###********##

    py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:38073)

On first thought , i assumed, the error might be due to memory error(TaskMemoryManager) and from Total 16gb, the process was consuming max 6 gb, leaving 9+gb free. Also I had set the driver memory as 10G. so Pass.

But, when I do a count() or show() on my final dataframe, it was a success op. But while doing toCsv, it is throwing the above errors/Warnings.

Don't actually understand/guess what might be causing the issue.

Please help me analyzing the above errors. Any help/comment is welcome. Thanks.

like image 475
Satya Avatar asked Oct 10 '16 17:10

Satya


1 Answers

In our case, we had lot of smaller tables (< 10 MB). So we decided to disable the broadcast and in addition to that started using G1GC for garbage collection. Add these entries to you spark-defaults.conf file in $SPARK_HOME/conf

spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions  -XX:+UseG1GC
spark.sql.autoBroadcastJoinThreshold    -1

Or as an alternative you can adjust the threshold size for autoBroadcast and see it that solves the issue.

like image 54
user2608613 Avatar answered Oct 13 '22 06:10

user2608613