Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well:

  1. I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what the executor is up to.
  2. Cloud Logging has nice filtering and search

Currently the only output from Dataproc that shows up in Cloud Logging is log items from yarn-yarn-nodemanager-* and container_*.stderr. Output from my application code is shown in Dataproc->Jobs but not in Cloud Logging, and it's only the output from the Spark master, not the executors.

like image 854
Thomas Oldervoll Avatar asked Dec 09 '15 18:12

Thomas Oldervoll


People also ask

How do I check logs in Dataproc?

You can access Dataproc cluster logs using the Logs Explorer, the gcloud logging command, or the Logging API. To pre-select a cluster in the Logs Explorer: Click the cluster name on the Clusters page in console to open the Cluster details page. Click View Logs.

Which command will allow you to view the details of a cloud Dataproc job?

You can view your job's driver output from the command line using the gcloud dataproc jobs wait command shown below (for more information, see View job output–GCLOUD COMMAND).

What types of jobs can be run on Google Dataproc?

What type of jobs can I run? Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.

What is the use of Dataproc in GCP?

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them.


2 Answers

tl;dr

This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.

Workaround

Cloud Dataproc clusters use fluentd to collect and forward logs to Cloud Logging. The configuration of fluentd is why you see some logs forwarded and not others. Therefore, the simple workaround (until Cloud Dataproc has support for job details in Cloud Logging) is to modify the flientd configuration. The configuration file for fluentd on a cluster is at:

/etc/google-fluentd/google-fluentd.conf

There are two things to gather additional details which will be easiest:

  1. Add a new fluentd plugin based on your needs
  2. Add a new file to the list of existing files collected (line 56 has the files on my cluster)

Once you edit the configuration, you'll need to restart the google-fluentd service:

/etc/init.d/google-fluentd restart

Finally, depending on your needs, you may or may not need to do this across all nodes on your cluster. Based on your use case, it sounds like you could probably just change your master node and be set.

like image 80
James Avatar answered Oct 14 '22 15:10

James


You can use the dataproc initialization actions for stackdriver for this:

gcloud dataproc clusters create <CLUSTER_NAME> \
    --initialization-actions gs://<GCS_BUCKET>/stackdriver.sh \
    --scopes https://www.googleapis.com/auth/monitoring.write
like image 42
Anton Skovorodko Avatar answered Oct 14 '22 15:10

Anton Skovorodko