Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the Hadoop File System Counters

I want to understand the filesystem counters in hadoop.

Below are the counters for a job that I ran.

In every job that I run, I observe that the Map file bytes read is like almost equal to the HDFS bytes read. And I observe that the file bytes written by the map is the sum of the file bytes and hdfs bytes read by the mapper. Pls help! Is the same data being read by both local file and hdfs, and both are being written to the local file system by the Map Phase?

                Map                        

FILE_BYTES_READ 5,062,341,139

HDFS_BYTES_READ 4,405,881,342

FILE_BYTES_WRITTEN 9,309,466,964

HDFS_BYTES_WRITTEN 0

Thanks!

like image 297
Mahalakshmi Lakshminarayanan Avatar asked Dec 26 '22 03:12

Mahalakshmi Lakshminarayanan


1 Answers

So the answer is really that what you are noticing is job specific. Depending on the job the mappers/reducers will write more or less bytes to local file compared to the hdfs.

In your mapper case, you have a similar amount of data that was read in from both local and HDFS locations, there is no problem there. Your Mapper code just happens to need to read about the same amount of data locally as it reads from HDFS. Most of the time the Mappers are being used to analyze an amount of data greater than it's RAM, so it's not surprising to see it possibly writing the data it gets from the HDFS to a local drive. The number of bytes read from HDFS and local are not always going to look like they sum up to the local write size (which they don't even in your case).

Here is an example using TeraSort, with 100G of data, 1 billion key/value pairs.

    File System Counters
            FILE: Number of bytes read=219712810984
            FILE: Number of bytes written=312072614456
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=100000061008
            HDFS: Number of bytes written=100000000000
            HDFS: Number of read operations=2976
            HDFS: Number of large read operations=0

Things to notice. The number of bytes read and written from the HDFS is nearly exactly 100G. This is because 100G needed to be sorted, and the final sorted files need to be written. Also notice that it needs to do a lot of local read/writes to hold and sort the data, 2x and 3x the amount of data it read!

As a final note, unless you just want to run a job without caring about the result. The amount of HDFS bytes written should never be 0, and yours is HDFS_BYTES_WRITTEN 0

like image 136
greedybuddha Avatar answered Dec 28 '22 17:12

greedybuddha