Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measure throughput at datanode

Tags:

hadoop

hdfs

I want to measure the throughput at each datanode by measuring the time taken for each read/write operation. It is very confusing to read through the million functions and find out where this is happening. Could someone list the series of calls made while reading/writing a block of data? am using version 1.0.1. Alternatively, if there is already an API which measures this at the datanode I could use that information.

like image 700
Bug Killer Avatar asked Nov 13 '22 06:11

Bug Killer


1 Answers

The important classes to study to measure throughput are FSDataOutputStream for writes and FSDataInputStream for reads.

File Read: The first thing that a node does when reading a file is call open() on the FileSystem object. At this point, you know that this node will begin reading shortly and you can place code after this call returns successfully to prepare for your measurements. Calling open() on HDFS instantiates a DistributedFileSystem who communicates with the NameNode to collect block locations (sorted according to calling node proximity). Finally, the DistributedFileSystem object returns FSDataInputStream ("sees" reading a file) who in turn wraps DFSInputStream ("sees" reading blocks, handles failure). Your measurements would be scoped within the read() and close() call on the FSDataInputStream.

File Write: The node will call create() on the FileSystem. Various checks are made at this point that encompass file permissions, availability etc, but upon successful completion it returns a FSDataOutputStream object who wraps a DFSOutputStream. The same concept applies where one sees a continuous write the other handles the coherency of the replication factor (i.e. one write = three writes) and failure. Similarly to a read, your measurements would be scoped within the write() and close() call on the FSDataInputStream.

In order to do this globally for all nodes in your cluster, you would need to override these methods as part of the distribution of Hadoop you share in your cluster.

like image 53
Engineiro Avatar answered Nov 26 '22 07:11

Engineiro