My spark job seems to spend alot of time getting blocks. Sometimes it will do this for an hour or 2. I have 1 partition for my dataset so I'm not sure why its doing so much shuffling. Anyone know what exactly is happening here?
15/12/16 18:05:27 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:05:27 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:05:27 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:05:40 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:05:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:05:40 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:05:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:05:59 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:05:59 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:05:59 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:05:59 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:13 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:06:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:13 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:06:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:33 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:06:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:33 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:06:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:49 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:06:49 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:06:49 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:06:49 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:07:14 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:07:14 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/12/16 18:07:14 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:07:14 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:07:33 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:07:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/12/16 18:07:33 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:07:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:07:46 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:07:46 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/12/16 18:07:47 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:07:47 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:07:58 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks
15/12/16 18:07:58 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/12/16 18:07:58 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
15/12/16 18:07:58 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
ShuffleBlockFetcherIterator
is a Scala Iterator that fetches multiple shuffle blocks (aka shuffle map outputs) from local and remote BlockManagers.
It allows for iterating over a sequence of blocks as (BlockId, InputStream) pairs so a caller can handle shuffle blocks in a pipelined fashion as they are received.
For performance - you need to tune your operations; or configs.
FYI spark is probably doing more than just fetching blocks, logging is probably disabled for everything else. If you haven't yet go to the spark history server and view the SQL tab of your query. Chances are this is a symptom that you are shuffling and processing too much data. try reducing the amount of data you're dragging around or break it into smaller pieces or get a bigger cluster.
"I have 1 partition for my dataset so I'm not sure why its doing so much shuffling."
Also keep in mind that partition is an overloaded word in spark. even if you have 1 table partition you may have multiple spark data slice partitions while processing data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With