Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most elegant and robust way on dataproc to adjust log levels for Spark?

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf

Several suggestions:

On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying

--properties 'log4j:hadoop.root.logger=WARN,console'

Maybe not, as from the docs:

The --properties command cannot modify configuration files not shown above.

Another way would be to use a shell script during cluster init and run sed:

# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
                     /etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
                    /etc/hadoop/conf/log4j.properties

But is it enough or do we need to change the env variable hadoop.root.logger as well?

like image 834
Frank Avatar asked Mar 23 '16 08:03

Frank


People also ask

How do I check logs in Dataproc?

Access job logs in Logging You can access Dataproc job logs using the Logs Explorer, the gcloud logging command, or the Logging API. Dataproc Job driver and YARN container logs are listed under are listed under the Cloud Dataproc Job resource.

What is difference between Dataproc and dataflow?

Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem. Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine.

Which Google cloud offering is best suitable for Apache spark and presto solutions?

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.


2 Answers

At the moment, you're right that --properties doesn't support extra log4j settings, but it's certainly something we've talked about adding; some considerations include how much to balance the ability to do fine-grained control over Spark vs Yarn vs other long-running daemons' logging configs (hiveserver2, HDFS daemons, etc) compared to keeping a minimal/simple setting which is plumbed through to everything in a shared way.

At least for Spark driver logs, you can use the --driver-log-levels setting a job-submission time which should take precedence over any of the /etc/*/conf settings, but otherwise as you describe, init actions are a reasonable way to edit the files for now on cluster startup, keeping in mind that they may change over time and releases.

like image 198
Dennis Huo Avatar answered Sep 28 '22 09:09

Dennis Huo


Recently, the support for log4j properties have been added via the --properties tag. For example: you can now use "--properties 'hadoop-log4j:hadoop.root.logger=WARN,console'". See this page(https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties) for more details

like image 34
Himanshu Kohli Avatar answered Sep 28 '22 10:09

Himanshu Kohli