Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop (HDFS) - file versioning

Tags:

At the given time I have user file system in my application (apache CMIS). As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well. The problem: The current file system provides versioning of the files. When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself. Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).

Answer

For full details: see comments on answer(s) below

Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3: Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides. Hadoop will still use YARN for the distributed processing.

like image 817
Vandeperre Maarten Avatar asked Mar 13 '17 09:03

Vandeperre Maarten


People also ask

What is Hadoop version file?

Namenode directory structure This mechanism provides resilience, particularly if one of the directories is an NFS mount, as is recommended. The VERSION file is a Java properties file that contains information about the version of HDFS that is running.

Can you modify an existing file in HDFS?

Your answerYou can not modified data once stored in hdfs because hdfs follows Write Once Read Many model. You can only append the data once stored in hdfs.

How files are stored in HDFS?

How Does HDFS Store Data? HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the master node in the cluster, the NameNode. The master node distributes replicas of these data blocks across the cluster.

How does the HDFS architecture provide data reliability?

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.


1 Answers

Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.

like image 82
franklinsijo Avatar answered Sep 25 '22 10:09

franklinsijo