I am trying to understand where hadoop stores data in HDFS. I refer to the config files viz: core-site.xml
and hdfs-site.xml
The property that I have set is:
In core-site.xml
:
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp</value>
</property>
In hdfs-site.xml
:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/hdfs/datanode</value>
</property>
With the above arrangement, like dfs.datanode.data.dir
, the data blocks should be stored in this directory. Is this correct?
I referred to the apache hadoop link, and from that i see this:
core-default.xml
: hadoop.tmp.dir
--> A base for other temporary directories.
hdfs-default.xml
dfs.datanode.data.dir
--> Determines where on the local filesystem an DFS data node should store its blocks.
The default value for this property being -> file://${hadoop.tmp.dir}/dfs/data
Since I explicitly provided the value for dfs.datanode.data.dir
(hdfs-site.xml
), does it mean data would be stored in that location? If so, would dfs/data be added to the directory to ${dfs.datanode.data.dir}
, specifically would it become -> /hadoop/hdfs/datanode/dfs/data
?
However I didn't see this directory structure getting created.
One observation that I saw in my env:
I saw that after I run some MapReduce programs, this directory is created viz:
/hadoop/tmp/dfs/data
is getting created.
So, not sure if data gets stored in the directory as suggested by the property dfs.datanode.data.dir.
Does anyone have similar experience?
HDFS has a primary NameNode, which keeps track of where file data is kept in the cluster. HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one per node in a cluster. The DataNodes are generally organized within the same rack in the data center.
The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system.
The default setting is: ${hadoop. tmp. dir}/dfs/data and note that the ${hadoop. tmp.
Hadoop has a file system that is much like the one on your desktop computer, but it allows us to distribute files across many machines. HDFS organizes information into a consistent set of file blocks and storage blocks for each node. HDFS uses MapReduce to process and analyze data.
The data for hdfs files will be stored in the directory specified in dfs.datanode.data.dir
, and the /dfs/data
suffix that you see in the default value will not be appended.
If you edit hdfs-site.xml
, you'll have to restart the DataNode service for the change to take effect. Also remember that changing the value will eliminate the ability of the DataNode service to supply blocks that were stored in the previous location.
Lastly, above you have your values specified with file:/...
instead of file://...
. File URI's do need that extra slash, so that might be causing these values to revert to the defaults.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With