I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh
and the XML files like core-default.xml
.
hadoop-env.sh
and the XML files like core-default.xml
. I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR
and HADOOP_PID_DIR
, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir
in core-default.xml and dfs.datanode.data.dir
in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir
at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir
?
We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.
export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"
This method can be used to configure other configurations also, like namenode url.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With