Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to append to an hdfs file on an extremely small cluster (3 nodes or less)

Tags:

java

hadoop

hdfs

I am trying to append to a file on an hdfs on a single node cluster. I also tried on a 2 node cluster but get the same exceptions.

In hdfs-site, I have dfs.replication set to 1. If I set dfs.client.block.write.replace-datanode-on-failure.policy to DEFAULT I get the following exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.10.37.16:50010], original=[10.10.37.16:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

If I follow the recommendation in the comment for the configuration in hdfs-default.xml for extremely small clusters (3 nodes or less) and set dfs.client.block.write.replace-datanode-on-failure.policy to NEVER I get the following exception:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot append to file/user/hadoop/test. Name node is in safe mode.
The reported blocks 1277 has reached the threshold 1.0000 of total blocks 1277. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 3 seconds.

Here's how I try to append:

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://MY-MACHINE:8020/user/hadoop");
conf.set("hadoop.job.ugi", "hadoop");

FileSystem fs = FileSystem.get(conf);
OutputStream out = fs.append(new Path("/user/hadoop/test"));

PrintWriter writer = new PrintWriter(out);
writer.print("hello world");
writer.close();

Is there something I am doing wrong in the code? maybe, there is something missing in the configuration? Any help will be appreciated!

EDIT

Even though that dfs.replication is set to 1, when I check the status of the file through

FileStatus[] status = fs.listStatus(new Path("/user/hadoop"));

I find that status[i].block_replication is set to 3. I don't think that this the problem because when I changed the value of dfs.replication to 0 I got a relevant exception. So apparently it does indeed obey the value of dfs.replication but to be on the safe side, is there a way to change the block_replication value per file?

like image 566
MoustafaAAtta Avatar asked Jul 03 '14 08:07

MoustafaAAtta


People also ask

Is it advisable to use HDFS for storing multiple small files?

Problems with small files and HDFS A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files.

How do I combine two HDFS files?

Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system. We want to merge the 2 files present inside are HDFS i.e. file1. txt and file2. txt, into a single file output.

How do I find small files on HDFS?

The first method to handle small files consists on grouping them in Hadoop Archive (HAR). However, it can lead to read performance problems. The other solution was SequenceFiles with file names as keys and content as values.


2 Answers

As I mentioned in the edit. Even though the dfs.replication is set to 1, fileStatus.block_replication is set to 3.

A possible solution is to run

hadoop fs -setrep -w 1 -R /user/hadoop/

Which will change the replication factor for each file recursively in the given directory. Documentation for the command can be found here.

What to be done now is to look why the value in hdfs-site.xml is ignored. And how to force the value 1 to be the default.

EDIT

It turns out that the dfs.replication property has to be set in the Configuration instance too, otherwise it requests that the replication factor for the file be the default which is 3 regardless of the value set in hdfs-site.xml

Adding to the code the following statement will solve it.

conf.set("dfs.replication", "1");
like image 99
MoustafaAAtta Avatar answered Nov 15 '22 14:11

MoustafaAAtta


I also faced the same exception as you initially posted and I solved the problem thanks to your comments (set dfs.replication to 1).

But I don't understand something, what happens if I do have replication? In that case isn't it possible to append to a file?

I will appreciate your answer and if you had an experience with it.

Thanks

like image 22
user1002065 Avatar answered Nov 15 '22 16:11

user1002065