Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to a file in HDFS in Hadoop

I was looking for a Disk intensive Hadoop application to test the I/O activity in Hadoop but I couldn't find any such application which kept the Disk utilization above, say 50% or some such application which actually keeps disk busy. I tried randomwriter, but that surprisingly is not disk I/o intensive.

So, I wrote a tiny program to create a file in Mapper and write some text into it. This application works well, but the utilization is high only in the master node which is also name node, job tracker and one of the slaves. The disk utilization is NIL or negligible in the other task trackers. I'm unable to understand why disk I/O is so low in task trackers. Could anyone please nudge me in right direction if I'm doing something wrong? Thanks in advance.

Here is my sample code segment that I wrote in WordCount.java file to create and write UTF string into a file-

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outFile;
while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
    outFile = new Path("./dummy"+ context.getTaskAttemptID());
    FSDataOutputStream out = fs.create(outFile);

    out.writeUTF("helloworld");
    out.close();
    fs.delete(outFile);
  }
like image 725
Gudda Bhoota Avatar asked Nov 19 '12 16:11

Gudda Bhoota


1 Answers

I think that any mechanism which creates java objects per cell in each row, and run any doing serialization of the java objects before saving it to disk has little chance to utilize IO.
In my experience serialization is working in speed of several MBs per second or a bit more, but not 100 MB per second.
So what you did avoiding hadoop layers on the output path is quite right. Now lets consider how write to HDFS works. The data is written to the local disk via local datanode, and then synchronously to other nodes in the network, depending on your replication factor. In this case you can not write more data into HDFS then Your network bandwidth. If your cluster is relatively small things get worth. For 3 node cluster and triple replication you will path all data to all nodes so whole cluster HDFS write bandwidth will be about 1 GBit - if you have such network.
So, I would suggest to:
a) Reduce replication factor to 1, thus stop being bound by network.
b) Write bigger chunks of data in one call to mapper

like image 168
David Gruzman Avatar answered Oct 04 '22 21:10

David Gruzman