Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Oozie job stuck at START action in PREP state

Tags:

hadoop

oozie

I have an Oozie job that I start from java client which gets stuck at START action and it says that it is RUNNING, but the START node is in PREP state.

Why is that and how to fix the problem?

The Oozie workflow contains only one java action. The Hadoop version on the cluster is 2.4.0, and Oozie on cluster is 4.0.0.

Here is the workflow.xml

<workflow-app xmlns='uri:oozie:workflow:0.2' name='java-filecopy-wf'>
<start to='java1'/>
    <action name='java1'>
    <java>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>default</value>
            </property>
        </configuration>           
        <main-class>testingoozieclient.Client</main-class>
        <capture-output/>
    </java>
    <ok to="end" />
    <error to="fail" />
</action>
<kill name="fail">
    <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
</kill>
<end name='end' />

Here is the java client

    OozieClient oozieClient = new OozieClient(args[0]);

    Properties conf = oozieClient.createConfiguration();
    conf.setProperty(OozieClient.APP_PATH, args[1]);

    conf.setProperty("nameNode", args[2]);
    conf.setProperty("jobTracker", args[3]);

    String jobId = null;

    try{
        jobId = oozieClient.run(conf);
    }
    catch(OozieClientException ex){
        Logger.getLogger(Client.class.getName()).log(Level.SEVERE, null, ex);

    }

Since I tried this a couple of times, there are now 5,6 workflows that all have RUNNING as state, but when I look it through the web interface I can see that all of them are stuck on START node in PREP state?


After some of submitted workflows got killed I was able to start another workflow. This time the workflow goes from start to java action, but gets stuck in the java action in a similar way - it stays in the PREP state.

Here's what the log looks like

2015-06-22 17:54:37,366  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] Start action [0000030-150619153616589-oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action status=DONE
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action updated in DB!
2015-06-22 17:54:37,426  INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W
2015-06-22 17:54:37,676  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Start action [0000030-150619153616589-oozie-oozi-W@java1] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:38,316  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master:8020
2015-06-22 17:54:38,501  WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] credentials is null for the action
2015-06-22 17:54:38,640  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master:8020

enter image description here


This morning I found that the job is in state SUSPENDED, start node is OK, but the Java node is in START-RETRY, with the following error - JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

I should probably emphasize that the Oozie works on the same machine as Resource manager, so it's strange that it tries to start the workflow on the same machine, but says that connection has failed.

Here is the job log from Oozie:

    2015-06-22 17:54:37,366  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] Start action [0000030-150619153616589-oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action status=DONE
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action updated in DB!
2015-06-22 17:54:37,426  INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W
2015-06-22 17:54:37,676  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Start action [0000030-150619153616589-oozie-oozi-W@java1] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:38,316  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master01.novalocal:8020
2015-06-22 17:54:38,501  WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] credentials is null for the action
2015-06-22 17:54:38,640  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master01.novalocal:8020
2015-06-22 20:05:33,340  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Error starting action [java1]. ErrorType [TRANSIENT], ErrorCode [  JA006], Message [  JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
org.apache.oozie.action.ActionExecutorException:   JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:412)
    at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:837)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:988)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:215)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:60)
    at org.apache.oozie.command.XCommand.call(XCommand.java:280)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
    at org.apache.hadoop.ipc.Client.call(Client.java:1414)
    at org.apache.hadoop.ipc.Client.call(Client.java:1363)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy42.getDelegationToken(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:282)
    at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at com.sun.proxy.$Proxy43.getDelegationToken(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:452)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:166)
    at org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:220)
    at org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:400)
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1203)
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1200)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
    at org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1199)
    at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:377)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1031)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:786)
    ... 10 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
    at org.apache.hadoop.ipc.Client.call(Client.java:1381)
    ... 33 more
like image 492
Kobe-Wan Kenobi Avatar asked Oct 31 '22 01:10

Kobe-Wan Kenobi


2 Answers

Pls check the ports in job.properties This is usually problem with namenode and jobtracker ports. Make sure your jobtracker port is correct in job.properties file.

like image 107
Sagar Shimpi Avatar answered Nov 15 '22 08:11

Sagar Shimpi


The major reason for an oozie job getting stuck in PREP state (eventually moving to START_MANUAL state), is misconfiguration of Hadoop services ports.

nameNode=hdfs://localhost:9000
jobTracker=10.71.71.15:8032

If you are running YARN, then the default port for jobtracker is same as resource manager's port.

Also, try to fix other port issues like jobhistoryserver's port (as mentioned in oozie Error message).

like image 26
shriyog Avatar answered Nov 15 '22 06:11

shriyog