Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which HDFS operations are atomic?

I am trying to write code to import files into HDFS for use as a hive external table. I have found that using something like:

foo | ssh hostname "hdfs dfs -put - /destination/$FILENAME"

can cause a type of error where a temporary file is created and then renamed when complete. This can cause a race condition for hive between a directory listing and query execution.

One workaround is to copy to a temporary directory and "hdfs dfs mv" the file into position.

The specific and general/academic questions are:

  1. The "hdfs dfs -mv" command is atomic, right?
  2. What other HDFS commands or operations are atomic?
  3. Can two "hdfs dfs -mkdir" commands issued at approximately the same time believe they both succeeded?
  4. Is there better way to avoid race conditions with hive when moving files into position?
like image 856
Setjmp Avatar asked Sep 03 '13 05:09

Setjmp


1 Answers

In Hadoop FS introduction you can find requirements for atomicity

Here are the core expectations of a Hadoop-compatible FileSystem. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.

Atomicity

There are some operations that MUST be atomic. This is because they are often used to implement locking/exclusive access between processes in a cluster.

  1. Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
  2. Deleting a file.
  3. Renaming a file.
  4. Renaming a directory.
  5. Creating a single directory with mkdir().

...

Most other operations come with no requirements or guarantees of atomicity.

So to be sure you must check underlying filesystem. But based on those requirements answers are:

  1. yes
  2. listed above
  3. no
  4. imho renaming a file is good choice for the job
like image 108
tworec Avatar answered Sep 28 '22 04:09

tworec