I am trying to write code to import files into HDFS for use as a hive external table. I have found that using something like:
foo | ssh hostname "hdfs dfs -put - /destination/$FILENAME"
can cause a type of error where a temporary file is created and then renamed when complete. This can cause a race condition for hive between a directory listing and query execution.
One workaround is to copy to a temporary directory and "hdfs dfs mv" the file into position.
The specific and general/academic questions are:
In Hadoop FS introduction you can find requirements for atomicity
Here are the core expectations of a Hadoop-compatible FileSystem. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.
Atomicity
There are some operations that MUST be atomic. This is because they are often used to implement locking/exclusive access between processes in a cluster.
- Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
- Deleting a file.
- Renaming a file.
- Renaming a directory.
- Creating a single directory with mkdir().
...
Most other operations come with no requirements or guarantees of atomicity.
So to be sure you must check underlying filesystem. But based on those requirements answers are:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With