Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all Apache Spark executor logs

Tags:

apache-spark

I'd like to collect all the executor logs in the Spark application driver programmatically. (When something failed I want to collect and store all the relevant logs.) Is there a nice way to do this?

One idea is to create an empty RDD with one partition per executor. Then I somehow ensure that each partition is actually processed on a different executor (no idea how) and do a mapPartitions in which I load the executor log from disk, and then a collect to fetch them to the application.

like image 378
Daniel Darabos Avatar asked May 14 '15 09:05

Daniel Darabos


Video Answer


1 Answers

Perhaps there is a better way, but we use a script to sync executor logs to S3 every 5 seconds

#!/bin/sh
# This scripts syncs executor log files to S3.

while [[ $# > 1 ]]; do
  key="$1"
  case $key in
    -l|--log-uri)
        LOG_BUCKET="$2"
        shift
        ;;
    *)
        echo "Unknown option: ${key}"
        exit 1;
  esac
  shift
done

set -u

JOB_FLOW_ID=$(cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | sed -e 's,.*"\(j-.*\)".*,\1,g')

# Start background process that syncs every 5 seconds.
while true; do aws s3 sync /home/hadoop/spark/work ${LOG_BUCKET}/${JOB_FLOW_ID}/executors/`hostname`/; sleep 5; done &

We launch the script (which is stored on S3 in a file named sync-executor-logs.sh) in a bootstrap action

--bootstrap-actions Path=s3://path/to/my/script/sync-executor-logs.sh,Name=Sync-executor-logs,Args=[-l,s3://path/to/logfiles]
like image 187
Glennie Helles Sindholt Avatar answered Nov 12 '22 11:11

Glennie Helles Sindholt