Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behaviour with spark-submit

I am running the following code in pyspark:

In [14]: conf = SparkConf()

In [15]: conf.getAll()

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.eventLog.dir',
  u'hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory'),
 (u'spark.master', u'local[*]'),
 (u'spark.yarn.historyServer.address',
  u'http://ip-10-0-0-220.ec2.internal:18088'),
 (u'spark.executor.extraLibraryPath',
  u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'),
 (u'spark.app.name', u'pyspark-shell'),
 (u'spark.driver.extraLibraryPath',
  u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native')]

In [16]: sc

<pyspark.context.SparkContext at 0x7fab9dd8a750>

In [17]: sc.version

u'1.4.0'

In [19]: sqlContext

<pyspark.sql.context.HiveContext at 0x7fab9de785d0>

In [20]: access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")

And everything runs smoothly (I can create tables in the Hive Metastore, etc.)

But when I try to run this code with spark-submit:

# -*- coding: utf-8 -*-                                                                                                                                                                                                                                                           

from __future__ import print_function

import re

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import Row
from pyspark.conf import SparkConf

if __name__ == "__main__":

    sc = SparkContext(appName="Minimal Example 2")

    conf = SparkConf()

    print(conf.getAll())

    print(sc)

    print(sc.version)

    sqlContext = HiveContext(sc)

    print(sqlContext)

    # ## Read the access log file                                                                                                                                                                                                                                                 
    access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")

    sc.stop()

I run this code with:

$ spark-submit --master yarn-cluster  --deploy-mode cluster minimal-example2.py

and runs without error (apparently), but if you check the logs:

$ yarn logs -applicationId application_1435696841856_0027       

It reads as:

15/07/01 16:55:10 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-220.ec2.internal/10.0.0.220:8032


Container: container_1435696841856_0027_01_000001 on ip-10-0-0-36.ec2.internal_8041
=====================================================================================
LogType: stderr
LogLength: 21077
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/yarn/nm/usercache/nanounanue/filecache/133/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/07/01 16:54:00 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
15/07/01 16:54:01 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1435696841856_0027_000001
15/07/01 16:54:02 INFO spark.SecurityManager: Changing view acls to: yarn,nanounanue
15/07/01 16:54:02 INFO spark.SecurityManager: Changing modify acls to: yarn,nanounanue
15/07/01 16:54:02 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, nanounanue); users with modify permissions: Set(yarn, nanounanue)
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Waiting for spark context initialization
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 
15/07/01 16:54:03 INFO spark.SparkContext: Running Spark version 1.4.0
15/07/01 16:54:03 INFO spark.SecurityManager: Changing view acls to: yarn,nanounanue
15/07/01 16:54:03 INFO spark.SecurityManager: Changing modify acls to: yarn,nanounanue
15/07/01 16:54:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, nanounanue); users with modify permissions: Set(yarn, nanounanue)
15/07/01 16:54:03 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/07/01 16:54:03 INFO Remoting: Starting remoting
15/07/01 16:54:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:41190]
15/07/01 16:54:03 INFO util.Utils: Successfully started service 'sparkDriver' on port 41190.
15/07/01 16:54:04 INFO spark.SparkEnv: Registering MapOutputTracker
15/07/01 16:54:04 INFO spark.SparkEnv: Registering BlockManagerMaster
15/07/01 16:54:04 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/blockmgr-14127054-19b1-4cfe-80c3-2c5fc917c9cf
15/07/01 16:54:04 INFO storage.DiskBlockManager: Created local directory at /data0/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/blockmgr-c8119846-7f6f-45eb-911b-443cb4d7e9c9
15/07/01 16:54:04 INFO storage.MemoryStore: MemoryStore started with capacity 245.7 MB
15/07/01 16:54:04 INFO spark.HttpFileServer: HTTP File server directory is /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/httpd-c4abf72b-2ee4-45d7-8252-c68f925bef58
15/07/01 16:54:04 INFO spark.HttpServer: Starting HTTP Server
15/07/01 16:54:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/01 16:54:04 INFO server.AbstractConnector: Started [email protected]:56437
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'HTTP file server' on port 56437.
15/07/01 16:54:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/07/01 16:54:04 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/07/01 16:54:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/01 16:54:04 INFO server.AbstractConnector: Started [email protected]:37958
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'SparkUI' on port 37958.
15/07/01 16:54:04 INFO ui.SparkUI: Started SparkUI at http://10.0.0.36:37958
15/07/01 16:54:04 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49759.
15/07/01 16:54:04 INFO netty.NettyBlockTransferService: Server created on 49759
15/07/01 16:54:05 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/07/01 16:54:05 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.0.0.36:49759 with 245.7 MB RAM, BlockManagerId(driver, 10.0.0.36, 49759)
15/07/01 16:54:05 INFO storage.BlockManagerMaster: Registered BlockManager
15/07/01 16:54:05 INFO scheduler.EventLoggingListener: Logging events to hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory/application_1435696841856_0027_1
15/07/01 16:54:05 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#-1566924249])
15/07/01 16:54:05 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-220.ec2.internal/10.0.0.220:8030
15/07/01 16:54:05 INFO yarn.YarnRMClient: Registering the ApplicationMaster
15/07/01 16:54:05 INFO yarn.YarnAllocator: Will request 2 executor containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/07/01 16:54:05 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
15/07/01 16:54:05 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
15/07/01 16:54:05 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/07/01 16:54:11 INFO impl.AMRMClientImpl: Received new token for : ip-10-0-0-99.ec2.internal:8041
15/07/01 16:54:11 INFO impl.AMRMClientImpl: Received new token for : ip-10-0-0-37.ec2.internal:8041
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching container container_1435696841856_0027_01_000002 for on host ip-10-0-0-99.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://[email protected]:41190/user/CoarseGrainedScheduler,  executorHostname: ip-10-0-0-99.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching container container_1435696841856_0027_01_000003 for on host ip-10-0-0-37.ec2.internal
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Starting Executor Container
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://[email protected]:41190/user/CoarseGrainedScheduler,  executorHostname: ip-10-0-0-37.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Starting Executor Container
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Preparing Local resources
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Preparing Local resources
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Prepared Local resources Map(__spark__.jar -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar" } s
ize: 162896305 timestamp: 1435784032445 type: FILE visibility: PRIVATE, pyspark.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/pyspark.zip" } size: 281333 timestamp: 1435784
032613 type: FILE visibility: PRIVATE, py4j-0.8.2.1-src.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip" } size: 37562 timestamp: 1435784032652 type: FIL
E visibility: PRIVATE, minimal-example2.py -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/minimal-example2.py" } size: 2448 timestamp: 1435784032692 type: FILE visibility: PRIVA
TE)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Prepared Local resources Map(__spark__.jar -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar" } s
ize: 162896305 timestamp: 1435784032445 type: FILE visibility: PRIVATE, pyspark.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/pyspark.zip" } size: 281333 timestamp: 1435784
032613 type: FILE visibility: PRIVATE, py4j-0.8.2.1-src.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip" } size: 37562 timestamp: 1435784032652 type: FIL
E visibility: PRIVATE, minimal-example2.py -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/minimal-example2.py" } size: 2448 timestamp: 1435784032692 type: FILE visibility: PRIVA
TE)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with environment: Map(CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CLIENT_CONF_DIR<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOO
P_HDFS_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH, SPARK_LOG_URL_STDERR -> http://ip-10-0-0-37.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000003/nanounan
ue/stderr?start=0, SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1435696841856_0027, SPARK_YARN_CACHE_FILES_FILE_SIZES -> 162896305,281333,37562,2448, SPARK_USER -> nanounanue, SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE,PRIVATE,PRIVATE,PRIVATE, SPARK_YARN_MODE -> 
true, SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1435784032445,1435784032613,1435784032652,1435784032692, PYTHONPATH -> pyspark.zip:py4j-0.8.2.1-src.zip, SPARK_LOG_URL_STDOUT -> http://ip-10-0-0-37.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000003/nanou
nanue/stdout?start=0, SPARK_YARN_CACHE_FILES -> hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar#__spark__.jar,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/applic
ation_1435696841856_0027/pyspark.zip#pyspark.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip#py4j-0.8.2.1-src.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_14
35696841856_0027/minimal-example2.py#minimal-example2.py)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with environment: Map(CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CLIENT_CONF_DIR<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOO
P_HDFS_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH, SPARK_LOG_URL_STDERR -> http://ip-10-0-0-99.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000002/nanounan
ue/stderr?start=0, SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1435696841856_0027, SPARK_YARN_CACHE_FILES_FILE_SIZES -> 162896305,281333,37562,2448, SPARK_USER -> nanounanue, SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE,PRIVATE,PRIVATE,PRIVATE, SPARK_YARN_MODE -> 
true, SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1435784032445,1435784032613,1435784032652,1435784032692, PYTHONPATH -> pyspark.zip:py4j-0.8.2.1-src.zip, SPARK_LOG_URL_STDOUT -> http://ip-10-0-0-99.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000002/nanou
nanue/stdout?start=0, SPARK_YARN_CACHE_FILES -> hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar#__spark__.jar,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/applic
ation_1435696841856_0027/pyspark.zip#pyspark.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip#py4j-0.8.2.1-src.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_14
35696841856_0027/minimal-example2.py#minimal-example2.py)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native:$LD_LIBRARY_PATH", {{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx
1024m, -Djava.io.tmpdir={{PWD}}/tmp, '-Dspark.ui.port=0', '-Dspark.driver.port=41190', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, akka.tcp://[email protected]:41190/user/CoarseGrainedScheduler, --e
xecutor-id, 1, --hostname, ip-10-0-0-99.ec2.internal, --cores, 1, --app-id, application_1435696841856_0027, --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native:$LD_LIBRARY_PATH", {{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx
1024m, -Djava.io.tmpdir={{PWD}}/tmp, '-Dspark.ui.port=0', '-Dspark.driver.port=41190', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, akka.tcp://[email protected]:41190/user/CoarseGrainedScheduler, --e
xecutor-id, 2, --hostname, ip-10-0-0-37.ec2.internal, --cores, 1, --app-id, application_1435696841856_0027, --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-10-0-0-37.ec2.internal:8041
15/07/01 16:54:14 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:43176
15/07/01 16:54:15 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:58472
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:49047/user/Executor#563862009]) with ID 1
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:36122/user/Executor#1370723906]) with ID 2
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/07/01 16:54:15 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
15/07/01 16:54:15 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-10-0-0-99.ec2.internal:59769 with 530.3 MB RAM, BlockManagerId(1, ip-10-0-0-99.ec2.internal, 59769)
15/07/01 16:54:16 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-10-0-0-37.ec2.internal:48859 with 530.3 MB RAM, BlockManagerId(2, ip-10-0-0-37.ec2.internal, 48859)
15/07/01 16:54:16 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/07/01 16:54:17 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/07/01 16:54:17 INFO metastore.ObjectStore: ObjectStore, initialize called
15/07/01 16:54:17 INFO spark.SparkContext: Invoking stop() from shutdown hook
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
15/07/01 16:54:17 INFO ui.SparkUI: Stopped Spark web UI at http://10.0.0.36:37958
15/07/01 16:54:17 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:49047
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:36122
15/07/01 16:54:17 INFO ui.SparkUI: Stopped Spark web UI at http://10.0.0.36:37958
15/07/01 16:54:17 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:49047
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:36122
15/07/01 16:54:17 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/07/01 16:54:17 INFO storage.MemoryStore: MemoryStore cleared
15/07/01 16:54:17 INFO storage.BlockManager: BlockManager stopped
15/07/01 16:54:17 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/07/01 16:54:17 INFO spark.SparkContext: Successfully stopped SparkContext
15/07/01 16:54:17 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.)
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.)
15/07/01 16:54:17 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1435696841856_0027
15/07/01 16:54:17 INFO util.Utils: Shutdown hook called
15/07/01 16:54:17 INFO util.Utils: Deleting directory /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/pyspark-215f5c19-b1cb-47df-ad43-79da4244de61
15/07/01 16:54:17 INFO util.Utils: Deleting directory /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/tmp/spark-c96dc9dc-e6ee-451b-b09e-637f5d4ca990

LogType: stdout
LogLength: 2404
Log Contents:
[(u'spark.eventLog.enabled', u'true'), (u'spark.submit.pyArchives', u'pyspark.zip:py4j-0.8.2.1-src.zip'), (u'spark.yarn.app.container.log.dir', u'/var/log/hadoop-yarn/container/application_1435696841856_0027/container_1435696841856_0027_01_000001'), (u'spark.eventLog.dir', 
u'hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory'), (u'spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', u'ip-10-0-0-220.ec2.internal'), (u'spark.yarn.historyServer.address', u'http://ip-10-0-0-220.ec2.internal:18088'
), (u'spark.ui.port', u'0'), (u'spark.yarn.app.id', u'application_1435696841856_0027'), (u'spark.app.name', u'minimal-example2.py'), (u'spark.executor.instances', u'2'), (u'spark.executorEnv.PYTHONPATH', u'pyspark.zip:py4j-0.8.2.1-src.zip'), (u'spark.submit.pyFiles', u''), 
(u'spark.executor.extraLibraryPath', u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'), (u'spark.master', u'yarn-cluster'), (u'spark.ui.filters', u'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), (u'spark.org.apache.hadoop.yarn.server.w
ebproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', u'http://ip-10-0-0-220.ec2.internal:8088/proxy/application_1435696841856_0027'), (u'spark.driver.extraLibraryPath', u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'), (u'spark.yarn.app.attemptId', u
'1')]
<pyspark.context.SparkContext object at 0x3fd53d0>
1.4.0
<pyspark.sql.context.HiveContext object at 0x40a9110>
Traceback (most recent call last):
  File "minimal-example2.py", line 53, in <module>
    access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/context.py", line 591, in read
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 39, in __init__
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/context.py", line 619, in _ssql_ctx
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o53))

The important parte is the last line: "You must build Spark with Hive." Why? What am I doing wrong?

like image 225
nanounanue Avatar asked Jul 01 '15 21:07

nanounanue


2 Answers

I recently got this same issue. But it turned out that the message from Spark was misleading; there were no missing jars. The problem for me was that the Java class HiveContext, which is called by PySpark, parses the hive-site.xml when it's constructed and there was an exception being raised during construction. (PySpark catches this exception and incorrectly suggests that it's due to a missing jar.) It ended up being an error with the property hive.metastore.client.connect.retry.delay, which was set to 2s. The HiveContext class tries to parse this as an integer, which fails. Change it to 2 and remove the characters in hive.metastore.client.socket.timeout and hive.metastore.client.socket.lifetime.

Note that you can get a more descriptive error by calling sqlContext._get_hive_ctx() directly.

like image 170
santon Avatar answered Nov 13 '22 16:11

santon


You should create SQLContext instand of HiveContext

from pyspark.sql import  SQLContext
sqlContext=SQLContext(sc)
like image 20
Marl Avatar answered Nov 13 '22 15:11

Marl