Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing a file that is being written

Tags:

hadoop

hdfs

You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this file?

a.) They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
b.) They would see the current state of the file, up to the last bit written by the command.
c.) They would see the current of the file through the last completed block.
d.) They would see no content until the whole file written and closed.

From what I understand about the hadoop fs -put command the answer is D, however some say it is C.

Could anyone provide a constructive explanation for either of the options?

Thanks xx

like image 832
Denys Avatar asked Oct 29 '14 15:10

Denys


3 Answers

The reason why the the file will not be accessible until the whole file is written and closed (option D) is because, in order to access a file, the request is first sent to the NameNode, to obtain metadata relating to the different blocks that compose the file. This metadata will be written by the NameNode only after it receives confirmation that all blocks of the file were written successfully.

Therefore, even though the blocks are available, the user can't see the file until the metadata is updated, which is done after all blocks are written.

like image 158
Chaos Avatar answered Sep 23 '22 10:09

Chaos


As soon as a file is created, it is visible in the filesystem namespace. Any content written to the file is not guaranteed to be visible, however:

Once more than a block's worth of data has been written, the first block will be visible to new readers. This is true of subsequent blocks, too: it is always the current block being written that is not visible to other readers. (From Hadoop Definitive Guide, Coherency Model).

So, I would go with Option C.

Also, take a look at this related question.

like image 44
Ashrith Avatar answered Sep 26 '22 10:09

Ashrith


Seems both D and C are true as detailed by Chaos and Ashrith, respectively. I documented their results at https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought when playing with a 7.5 GB file.

In a nutshell, yes, the exact file name is NOT present until completed... AND... yes, you can actually read the file up to the last block written iF you realize the filename is temporarily suffixed with ._COPYING_.

like image 20
Lester Martin Avatar answered Sep 23 '22 10:09

Lester Martin