Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well: <ol> <li>I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what the executor is up to.</li> <li>Cloud Logging has nice filtering and search</li> </ol> Currently the only output from Dataproc that shows up in Cloud Logging is log items from yarn-yarn-nodemanager-* and container_*.stderr. Output from my application code is shown in Dataproc->Jobs but not in Cloud Logging, and it's only the output from the Spark master, not the executors.

tl;dr This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim. Workaround Cloud Dataproc clusters use fluentd to collect and forward logs to Cloud Logging. The configuration of fluentd is why you see some logs forwarded and not others. Therefore, the simple workaround (until Cloud Dataproc has support for job details in Cloud Logging) is to modify the flientd configuration. The configuration file for fluentd on a cluster is at: <code>/etc/google-fluentd/google-fluentd.conf</code> There are two things to gather additional details which will be easiest: <ol> <li>Add a new fluentd plugin based on your needs</li> <li>Add a new file to the list of existing files collected (line <code>56</code> has the files on my cluster)</li> </ol> Once you edit the configuration, you'll need to restart the <code>google-fluentd</code> service: <code>/etc/init.d/google-fluentd restart</code> Finally, depending on your needs, you may or may not need to do this across all nodes on your cluster. Based on your use case, it sounds like you could probably just change your master node and be set.

You can use the dataproc initialization actions for stackdriver for this: <pre class="prettyprint"><code>gcloud dataproc clusters create <CLUSTER_NAME> \ --initialization-actions gs://<GCS_BUCKET>/stackdriver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write </code></pre>

Output from Dataproc Spark job in Google Cloud Logging

Tags:

apache-spark

google-cloud-dataproc

google-cloud-logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well:

I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what the executor is up to.
Cloud Logging has nice filtering and search

Currently the only output from Dataproc that shows up in Cloud Logging is log items from yarn-yarn-nodemanager-* and container_*.stderr. Output from my application code is shown in Dataproc->Jobs but not in Cloud Logging, and it's only the output from the Spark master, not the executors.

854

asked Dec 09 '15 18:12

Thomas Oldervoll

2 Answers

tl;dr

This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.

Workaround

Cloud Dataproc clusters use fluentd to collect and forward logs to Cloud Logging. The configuration of fluentd is why you see some logs forwarded and not others. Therefore, the simple workaround (until Cloud Dataproc has support for job details in Cloud Logging) is to modify the flientd configuration. The configuration file for fluentd on a cluster is at:

/etc/google-fluentd/google-fluentd.conf

There are two things to gather additional details which will be easiest:

Add a new fluentd plugin based on your needs
Add a new file to the list of existing files collected (line 56 has the files on my cluster)

Once you edit the configuration, you'll need to restart the google-fluentd service:

/etc/init.d/google-fluentd restart

Finally, depending on your needs, you may or may not need to do this across all nodes on your cluster. Based on your use case, it sounds like you could probably just change your master node and be set.

answered Oct 14 '22 15:10

James

You can use the dataproc initialization actions for stackdriver for this:

gcloud dataproc clusters create <CLUSTER_NAME> \
    --initialization-actions gs://<GCS_BUCKET>/stackdriver.sh \
    --scopes https://www.googleapis.com/auth/monitoring.write

answered Oct 14 '22 15:10

Anton Skovorodko

Related questions
                            
                                use length function in substring in spark
                            
                                Convert timestamp to date in Spark dataframe
                            
                                How to find max value in pair RDD?
                            
                                create substring column in spark dataframe
                            
                                How to specify schema for CSV file without using Scala case class?
                            
                                Why does foreach not bring anything to the driver program?
                            
                                Creating a Spark DataFrame from an RDD of lists
                            
                                Spark 2.2 Illegal pattern component: XXX java.lang.IllegalArgumentException: Illegal pattern component: XXX
                            
                                Spark: run InputFormat as singleton
                            
                                Spark ML indexer cannot resolve DataFrame column name with dots?
                            
                                Application attempt appattempt_*** doesn't exist in ApplicationMasterService cache
                            
                                How to speed up Spark SQL unit tests?
                            
                                Why is Spark performing worse when using Kryo serialization?
                            
                                Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set
                            
                                Comparison between fasttext and LDA
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
                            
                                pyspark using one task for mapPartitions when converting rdd to dataframe
                            
                                Spark is only using one worker machine when more are available
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With