Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system.

As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.

The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.

The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.

So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?

If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.

Has anyone tried this before? Can it work or is this idea much too far off the intended use?

Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?

Thanks for your insights.


EDIT 2:

This is the config I created for hadoop 0.18.3 conf/hadoop-site.xml using the answer provided by bajafresh4life.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:33301</value>
  </property>

  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>localhost:33302</value>
    <description>
    The job tracker http server address and port the server will listen on.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

  <property>
    <name>mapred.task.tracker.http.address</name>
    <value>localhost:33303</value>
    <description>
    The task tracker http server address and port.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

</configuration>
like image 840
Niels Basjes Avatar asked Aug 23 '10 08:08

Niels Basjes


People also ask

Can you use Hadoop without HDFS?

You don't have to configure and start HDFS services, so it will run without HDFS. But you can not install YARN without Hadoop. You have to download the Hadoop and configure only YARN(and other services which you want to use).

Which of these are the properties of pseudo distributed mode in Hadoop processing?

xml, hdfs-site. xml. The pseudo-distribute mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine. In pseudo-distributed mode, all the Hadoop daemons will be running on a single node.


1 Answers

Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.

Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...

We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

like image 87
bajafresh4life Avatar answered Sep 20 '22 06:09

bajafresh4life