Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing hdfs from docker-hadoop-spark--workbench via zeppelin

I have installed https://github.com/big-data-europe/docker-hadoop-spark-workbench

Then started it up with docker-compose up . I navigated to the various urls mentioned in the git readme and all appears to be up.

I then started a local apache zeppelin with:

./bin/zeppelin.sh start

In zeppelin interpreter settings i have navigated then to spark interpreter and updated the master to point to the local cluster installed with docker

master: updated from from local[*] to spark://localhost:8080

I then run in a notebook the following code:

import  org.apache.hadoop.fs.{FileSystem,Path}

FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///")).foreach( x => println(x.getPath ))

I get this exception in zeppelin logs:

 INFO [2017-12-15 18:06:35,704] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20171212-200101_1553252595 using null org.apache.zeppelin.interpreter.LazyOpenInterpreter@32d09a20
 WARN [2017-12-15 18:07:37,717] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2064) - Job 20171212-200101_1553252595 is finished, status: ERROR, exception: null, result: %text java.lang.NullPointerException
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387)
    at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)
    at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
    at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

How can I access the hdfs from zeppelin and java/spark code?

like image 981
Jas Avatar asked Dec 15 '17 16:12

Jas


1 Answers

Reason for the exception is that the sparkSession object is null for some reason in Zeppelin.

Reference: https://github.com/apache/zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java

private SparkContext createSparkContext_2() {
    return (SparkContext) Utils.invokeMethod(sparkSession, "sparkContext");
}

Might be a configuration related issue. Please cross-verify the settings/configuration and spark cluster settings. Make sure that spark is working fine.

Reference: https://zeppelin.apache.org/docs/latest/interpreter/spark.html

Hope this helps.

like image 150
Marco99 Avatar answered Sep 21 '22 16:09

Marco99