Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do performance profiling of Hadoop cluster

Does anyone know how to do performance profiling of all java code running in a Hadoop cluster?

I will explain on a simple example. If we do a local java development, we can run Yourkit to measure the % of CPU taken by each method of each class. We can see that class A calls method X and this takes 90% of execution time of the whole app, and then fix the inefficiency in the code.

But if we are doing a mapreduce job and run it in the cluster, I would also like to see what is sluggish: our map/reduce code, or the framework itself. So, I would like to have a service which gets the information about each class/method call and % of time for its execution, which gathers this somewhere into HDFS, and then to analyze the method calling tree with CPU consumption.

Question: does anyone know if such a solution exists?

P.S. Note: I understand that such a thing will slow down the cluster. And I understand that such thing should be done either on a test cluster or in agreement with the customer. The question now is "does there exist such a thing?". Thanks.

like image 675
Ihor B. Avatar asked Jun 26 '15 17:06

Ihor B.


2 Answers

I solved the problem. Here http://ihorbobak.com/index.php/2015/08/05/cluster-profiling/ you may find the detailed instruction of how to do this.

Short summary how the profiling is done:

  • At every host of a cluster we put a special jar-file (a mod of StatsD JVM Profiler) with a javaagent that will be embedded in every JVM process that runs on that machine.
  • A “javaagent” is a piece of code that is used to instrument the programs running on the JVM. Profiler’s javaagent gathers stacktraces from JVM processes 100 times per second and sends them to a dedicated host running a NoSQL database called InfluxDB (https://influxdb.com).
  • After we run a distributed app and after the stacktraces are gathered, we run a set of scripts on this database to extract data about class/method execution and to visualize this data using Flame Graph.

Flame Graphs were invented by Brendann Gregg http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html. There is a very good video by Brendan that explains how it works: https://www.youtube.com/watch?v=nZfNehCzGdw . There is also a very good book by this author “Systems Performance: Enterprise and The Cloud” which I highly recommend to read.

like image 70
Ihor B. Avatar answered Oct 05 '22 23:10

Ihor B.


Sorry for bumping this old thread, but I feel this might be useful for other people as well.

We actually had a similar problem. One of our production jobs was producing a sub-optimal throughput without any indication why. Since we wanted to limit the dependencies on the clusternodes and sample different frameworks such as Spark, Hadoop, and even non-JVM based applications, we decided to build our own distributed profiler based on perf, and like Ihor, we are using FlameGraphs for visualization.

The software is currently in an alpha state (https://github.com/cerndb/Hadoop-Profiler), and currently only supports on-CPU profiling, but it already showed its potential when analyzing this job.

It basically works like this in a Hadoop context:

  1. User provides a Hadoop application ID.
  2. HProfiler will perform an API request to the YARN cluster to retrieve all nodes. However, one can also specify specific host addresses.
  3. Next, the profiler will initiate an SSH session with all nodes in order to check if a mapper is running on the host or not.
  4. Using this information, the profiler will initiate new SSH sessions to the nodes which are actually running the jobs in order to profile them. After the profiling, a Java mapping is constructed (using perf-map-agent) in order to map [unknown] methods to Java methods.
  5. Finally, all results are copied to the entry point and aggregated in order to provide a cluster average. If the user likes, he can also do "atypical node detection". This basically means that the program will identify nodes which do things differently compared to other nodes.

If you like, we did a more detailed write-up regarding this.

https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

I hope this helps!

like image 33
Joeri Hermans Avatar answered Oct 05 '22 23:10

Joeri Hermans