Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does HDFS with append works

Let's assume one is using default block size (128 MB), and there is a file using 130 MB ; so using one full size block and one block with 2 MB. Then 20 MB needs to be appended to the file (total should be now of 150 MB). What happens?

Does HDFS actually resize the size of the last block from 2MB to 22MB? Or create a new block?

How does appending to a file in HDFS deal with conccurency? Is there risk of dataloss ?

Does HDFS create a third block put the 20+2 MB in it, and delete the block with 2MB. If yes, how does this work concurrently?

like image 594
David Avatar asked Feb 06 '12 15:02

David


People also ask

Can we append data to HDFS file?

You can use appendToFile in your Hadoop file system command. This command appends the contents of all the given local files to the provided destination file on the HDFS filesystem. The destination file will be created if it is not existing earlier.

Is HDFS append only?

HDFS files cannot be edited and are append-only. Each file, once closed, can be opened only to append data to it. HDFS also does not guarantee that writes to a file are visible to other clients until the client writing the data flushes the data to data node memory, or closes the file.

How trunc () is different from append ()?

Append allows to add new data at the end of file while truncate to cut some last characters in file. Both are different logic: append is much simpler since it deals mostly with file length. Truncate in the other side must take into account such aspects as not full last block or truncated block referenced in snapshots.

Can files in HDFS be modified?

You can not modified data once stored in hdfs because hdfs follows Write Once Read Many model. You can only append the data once stored in hdfs.


1 Answers

According to the latest design document in the Jira issue mentioned before, we find the following answers to your question:

  1. HDFS will append to the last block, not create a new block and copy the data from the old last block. This is not difficult because HDFS just uses a normal filesystem to write these block-files as normal files. Normal file systems have mechanisms for appending new data. Of course, if you fill up the last block, you will create a new block.
  2. Only one single write or append to any file is allowed at the same time in HDFS, so there is no concurrency to handle. This is managed by the namenode. You need to close a file if you want someone else to begin writing to it.
  3. If the last block in a file is not replicated, the append will fail. The append is written to a single replica, who pipelines it to the replicas, similar to a normal write. It seems to me like there is no extra risk of dataloss as compared to a normal write.
like image 197
EthanP Avatar answered Jan 02 '23 23:01

EthanP