Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS vs LFS - How Hadoop Dist. File System is built over local file system?

Tags:

hadoop

From various blogs I read, I comprehended that HDFS is another layer that exists over Local filesystem in a computer.

I also installed hadoop but I have trouble understanding the existence of hdfs layer over local file system.

Here is my question..

Consider I am installing hadoop in pseudo-distributed mode. What happens under the hood during this installation? I added a tmp.dir parameter in configuration files. Is is the single folder to which namenode daemon talks to, when it attemps to access the datanode??

like image 342
data_set Avatar asked May 29 '13 10:05

data_set


1 Answers

OK..let me give it a try..When you configure Hadoop it lays down a virtual FS on top of your local FS, which is the HDFS. HDFS stores data as blocks(similar to the local FS, but much much bigger as compared to it) in a replicated fashion. But the HDFS directory tree or the filesystem namespace is identical to that of local FS. When you start writing data into HDFS, it eventually gets written onto the local FS only, but you can't see it there directly.

The temp directory actually serves 3 purposes :

1- Directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name and can be specified explicitly by dfs.name.dir. If you specify dfs.name.dir, then the namenode metedata will be stored in the directory given as the value of this property.

2- Directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data and can be specified explicitly by dfs.data.dir. If you specify dfs.data.dir, then the HDFS data will be stored in the directory given as the value of this property.

3- Directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary and can be specified explicitly by fs.checkpoint.dir.

So, it's always better to use some proper dedicated location as the values for these properties for a cleaner setup.

When access to a particular block of data is required metadata stored in the dfs.name.dir directory is searched and the location of that block on a particular datanode is returned to the client(which is somewhere in dfs.data.dir directory on the local FS). The client then reads data directly from there (same holds good for writes as well).

One important point to note here is that HDFS is not a physical FS. It is rather a virtual abstraction on top of your local FS which can't be browsed simply like the local FS. You need to use the HDFS shell or the HDFS webUI or the available APIs to do that.

HTH

like image 100
Tariq Avatar answered Nov 15 '22 10:11

Tariq