Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to configure logging in Hadoop / HDP components?

I have an HDP 2.4 cluster with the following services/components:

  1. HBase
  2. Kafka
  3. MapReduce2
  4. Storm
  5. Oozie
  6. Support services like Zookeeper, Ambari, Yarn, HDFS, etc.

I've been searching for this for several days now and would appreciate some help. I have the following two questions:

  1. How do I configure logging at both the application level (we're using log4j) and daemon level for all the services mentioned below?
  2. What is the best practice to view all the application level logs for these services in one consolidated place? Does Ambari have something to offer or do we need third party packages (and which ones are good)?

Thanks so much for any assistance you may be able to provide!

like image 203
Paras Avatar asked Jan 17 '17 11:01

Paras


1 Answers

If you are writing an application that leverages one or more HDP services, I would recommend updating the log4j.properties file for each of those services to match the logging level that you desire. The best way to do this is to use the Ambari Admin UI. To edit the log4j.properties of a service, follow the steps below:

  1. Click any one of the services on the left-hand side of the Dashboard.
  2. Once the Service Summary page has loaded, click the 'Configs' tab at the top of the screen.
  3. Click the 'Advanced' tab underneath the Version History Timeline, find the 'Advanced' set of properties and then search for the log4j.properties entry. Failing that, you can search for 'log4j' in the search bar at the top-right of the screen and Ambari will highlight the relevant settings.

See here for an image detailing an example of the log4j.properties file for the HDFS service.

Bear in mind that the log files for each of those services would only capture the interaction between your application and that service only. If you are working in Java I would personally recommend adding a log4j instance into your application; if you don't know how to do this, my recommendation is to follow this tutorial (found on this SO question) to get you set up correctly. Depending on how your application calls the APIs of each service you can interrogate the output of the command and log this into your own log file.


In terms of viewing log files in one centralised location, you have two options:

  1. Upgrade to HDP 2.5 to utilise Ambari Log Search.
  2. Stay on HDP 2.4 and create a solution from scratch using Flume.

I'll outline the two options below.

1. Upgrade to HDP 2.5 to utilise Ambari Log Search.

I would hazard that the "easier" method (by that I mean requiring the least effort on your part) would be to upgrade your cluster to HDP 2.5. The updated Hortonworks Data Platform brings a big overhaul to Ambari with its latest version, Ambari 2.4. This version includes Ambari Infra which allows you to view all log files, filter by log levels and perform graphing and complex functions thanks to Ambari Log Search.

If it isn't feasible for you to upgrade the entire cluster, another option is to obtain the Ambari 2.4 repository from Hortonworks' website and install it manually. A Hortonworks representative advised me that Ambari 2.4 can run on HDP 2.4 without issue so this may be a feasible alternative... Though I would recommend you check with Hortonworks yourself before attempting this!

The only downside with Ambari Log Search is that you wouldn't be able to include your applications logs in the search - Ambari Log Search is for Hadoop services only.

2. Stay on HDP 2.4 and create a solution from scratch using Flume.

If you don't want to upgrade to Ambari 2.4, then other options look to be a little scarce. I don't know of any open source solutions personally and some cursory Googling returns few results. Apache Chukwa and Cloudera's Scribe both are supposed to address distributed log collection in Hadoop but are both 9 years old. There's also an older Hortonworks process for log collection that leverages Flume for the same process which might be worth a look at. This SO thread also recommends Flume for other situations. It might be worth putting some thought into collecting logs from each servers /var/log/ directory using Flume.

The upside of this solution is that your applications log files could be included in the Flume workflow as a Source and included with the other HDP service logs (depending on where you decide to put them).

like image 134
Fredulom Avatar answered Nov 05 '22 13:11

Fredulom