You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this file?
a.) They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
b.) They would see the current state of the file, up to the last bit written by the command.
c.) They would see the current of the file through the last completed block.
d.) They would see no content until the whole file written and closed.
From what I understand about the hadoop fs -put
command the answer is D, however some say it is C.
Could anyone provide a constructive explanation for either of the options?
Thanks xx
The reason why the the file will not be accessible until the whole file is written and closed (option D) is because, in order to access a file, the request is first sent to the NameNode, to obtain metadata relating to the different blocks that compose the file. This metadata will be written by the NameNode only after it receives confirmation that all blocks of the file were written successfully.
Therefore, even though the blocks are available, the user can't see the file until the metadata is updated, which is done after all blocks are written.
As soon as a file is created, it is visible in the filesystem namespace. Any content written to the file is not guaranteed to be visible, however:
Once more than a block's worth of data has been written, the first block will be visible to new readers. This is true of subsequent blocks, too: it is always the current block being written that is not visible to other readers. (From Hadoop Definitive Guide, Coherency Model).
So, I would go with Option C.
Also, take a look at this related question.
Seems both D and C are true as detailed by Chaos and Ashrith, respectively. I documented their results at https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought when playing with a 7.5 GB file.
In a nutshell, yes, the exact file name is NOT present until completed... AND... yes, you can actually read the file up to the last block written iF you realize the filename is temporarily suffixed with ._COPYING_
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With