Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Hadoop through a SOCKS proxy?

Tags:

proxy

hadoop

So our Hadoop cluster runs on some nodes and can only be accessed from these nodes. You SSH into them and do your work.

Since that is quite annoying, but (understandably) nobody will even go near trying to configure access control so that it may be usable from outside for some, I'm trying the next best thing, i.e. using SSH to run a SOCKS proxy into the cluster:

$ ssh -D localhost:10000 the.gateway cat

There are whispers of SOCKS support (naturally I haven't found any documentation), and apparently that goes into core-site.xml:

<property>
  <name>fs.default.name</name>
  <value>hdfs://reachable.from.behind.proxy:1234/</value></property>
<property>
  <name>mapred.job.tracker</name>
  <value>reachable.from.behind.proxy:5678</value></property>
<property>
  <name>hadoop.rpc.socket.factory.class.default</name>
  <value>org.apache.hadoop.net.SocksSocketFactory</value></property>
<property>
  <name>hadoop.socks.server</name>
  <value>localhost:10000</value></property>

Except hadoop fs -ls / still fails, without any mention of SOCKS.

Any tips?


I'm only trying to run jobs, not administer the cluster. I only need to access HDFS and submit jobs, through SOCKS (it seems there's an entirely separate thing about using SSL/Proxies between the cluster nodes etc; I don't want that, my machine shouldn't be part of the cluster, just a client.)

Is there any useful documentation on that? To illustrate my failure to turn up anything useful: I found the configuration values by running the hadoop client through strace -f and checking out the configuration files it read.

Is there a description anywhere of which configuration values it even reacts to? (I have literally found zero reference documentation, just differently outdated tutorials, I hope I've been missing something?)

Is there a way to dump the configuration values it is actually using?

like image 269
pascal Avatar asked Aug 01 '14 00:08

pascal


1 Answers

The original code to implement this was added in https://issues.apache.org/jira/browse/HADOOP-1822

But this article also notes that you have to change the socket class to SOCKS

http://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-using-a-proxy/

with

<property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value> </property>

Edit: Note that the properties go in different files:

  1. fs.default.name and hadoop.socks.server and hadoop.rpc.socket.factory.class.default needs to go into core-site.xml
  2. mapred.job.tracker and mapred.job.tracker.http.address config needs to go into mapred-site.xml (for map-reduce config)
like image 68
dajobe Avatar answered Sep 25 '22 02:09

dajobe