I want to understand the filesystem counters in hadoop.
Below are the counters for a job that I ran.
In every job that I run, I observe that the Map file bytes read is like almost equal to the HDFS bytes read. And I observe that the file bytes written by the map is the sum of the file bytes and hdfs bytes read by the mapper. Pls help! Is the same data being read by both local file and hdfs, and both are being written to the local file system by the Map Phase?
Map
FILE_BYTES_READ 5,062,341,139
HDFS_BYTES_READ 4,405,881,342
FILE_BYTES_WRITTEN 9,309,466,964
HDFS_BYTES_WRITTEN 0
Thanks!
So the answer is really that what you are noticing is job specific. Depending on the job the mappers/reducers will write more or less bytes to local file compared to the hdfs.
In your mapper case, you have a similar amount of data that was read in from both local and HDFS locations, there is no problem there. Your Mapper code just happens to need to read about the same amount of data locally as it reads from HDFS. Most of the time the Mappers are being used to analyze an amount of data greater than it's RAM, so it's not surprising to see it possibly writing the data it gets from the HDFS to a local drive. The number of bytes read from HDFS and local are not always going to look like they sum up to the local write size (which they don't even in your case).
Here is an example using TeraSort, with 100G of data, 1 billion key/value pairs.
File System Counters
FILE: Number of bytes read=219712810984
FILE: Number of bytes written=312072614456
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=100000061008
HDFS: Number of bytes written=100000000000
HDFS: Number of read operations=2976
HDFS: Number of large read operations=0
Things to notice. The number of bytes read and written from the HDFS is nearly exactly 100G. This is because 100G needed to be sorted, and the final sorted files need to be written. Also notice that it needs to do a lot of local read/writes to hold and sort the data, 2x and 3x the amount of data it read!
As a final note, unless you just want to run a job without caring about the result. The amount of HDFS bytes written should never be 0, and yours is HDFS_BYTES_WRITTEN 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With