Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark java.io.IOException due to unknown reason

Tags:

apache-spark

I'm getting the following exception while running a Spark job. The job gets stuck at the same stage every time. The stage is a SQL query. I don't see any other exception in either Driver or Executor logs

java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:748)

This exception is wrapped between these errors:

ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hostname.domain.com/ip is closed

The only thing I could find in the executor logs was:

INFO memory.TaskMemoryManager: Memory used in task 12302
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@462e08e3: 32.0 MB
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap@41bed570: 2.4 GB
INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 12302 but are not associated with specific consumers
INFO memory.TaskMemoryManager: 2634274570 bytes of memory are used for execution and 1826540 bytes of memory are used for storage
INFO sort.UnsafeExternalSorter: Thread 197 spilling sort data of 512.0 MB to disk (0  time so far)

But I don't believe this is an issue due to memory. The job completes successfully in a different environment with the same amount of data.

Here's my spark-submit :

spark-submit --master yarn-cluster\
--conf spark.speculation=true \
--conf spark.default.parallelism=200 \
--conf spark.executor.memory=16G \
--conf spark.memory.storageFraction=$0.3 \
--conf spark.executor.cores=5 \
--conf spark.driver.memory=2G \
--conf spark.driver.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1638 \
--conf spark.driver.maxResultSize=1G \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--class com.test.TestClass Test.jar

I did read some articles here and there regarding a similar exception which point out towards increasing the heartbeat interval and network timeout. But I couldn't find a definitive answer.

How can I run this job successfully?

like image 873
philantrovert Avatar asked May 16 '26 08:05

philantrovert


1 Answers

This was being caused due to an issue with the data.

The driving table for all the left joins, had empty string '' as data in one of the columns which was being used to join to another table. Similariy, the other table also had a lot of empty strings for that particular column.

This was leading in a cross-join and since the number of rows were too many, the job was getting hung indefinitely.

Adding a filter to the right table, helped in solving the issue:

SELECT 
  *
FROM
  LEFT_TABLE LT
LEFT JOIN
  ( SELECT
      *
    FROM
      RIGHT_TABLE
    WHERE LENGTH(TRIM(PROBLEMATIC_COLUMN)) <> 0 ) RT
ON
  LT.PROBLEMATIC_COLUMN = RT.PROBLEMATIC_COLUMN
like image 97
philantrovert Avatar answered May 19 '26 03:05

philantrovert