I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like: <pre class="prettyprint"><code>#!/bin/bash hive -f hivescript </code></pre> Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script. I've tried two ways, first using a local path, like <code>hive -f /local/path/to/file</code>, and using a relative path like above, <code>hive -f hivescript</code>, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml. With both methods I get the error message: <code>"Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]"</code> on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know. My job.properties file: <pre class="prettyprint"><code>nameNode=hdfs://sandbox:8020 jobTracker=hdfs://sandbox:50300 queueName=default oozie.libpath=${nameNode}/user/oozie/share/lib oozie.use.system.libpath=true oozieProjectRoot=${nameNode}/user/sandbox/poc1 appPath=${oozieProjectRoot}/testwf oozie.wf.application.path=${appPath} </code></pre> And workflow.xml: <pre class="prettyprint"><code><shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>${appPath}/hive.sh</exec> <file>${appPath}/hive.sh</file> <file>${appPath}/hive_pill</file> </shell> <ok to="end"/> <error to="end"/> </action> <end name="end"/> </code></pre> My objective is to use oozie to call a hive script through a shell script, please give your suggestions.

One thing that has always been tricky about Oozie workflows is the execution of bash scripts. Hadoop is created to be massively parallel so the architecture acts very different than you would think. When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random. To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow. Here is a good way to approach what you are trying to achieve. <pre class="prettyprint"><code><shell xmlns="uri:oozie:shell-action:0.1"> <exec>hive.sh</exec> <file>/user/lib/hive.sh#hive.sh</file> <file>ETL_file1.hql#hivescript</file> </shell> </code></pre> One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow. Lastly you see the line: <pre class="prettyprint"><code><file>ETL_file1.hql#hivescript</file> </code></pre> This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows <pre class="prettyprint"><code>user/directory/workflow.xml user/directory/ETL_file1.hql </code></pre> and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files. HDFS directory notes, if the file is nested inside the same directory as the workflow, then you only need to specify child paths: <pre class="prettyprint"><code>user/directory/workflow.xml user/directory/hive/ETL_file1.hql </code></pre> Would yield: <pre class="prettyprint"><code><file>hive/ETL_file1.hql#hivescript</file> </code></pre> But if the path is outside of the workflow directory you will need the full path: <pre class="prettyprint"><code>user/directory/workflow.xml user/lib/hive.sh </code></pre> would yield: <pre class="prettyprint"><code><file>/user/lib/hive.sh#hive.sh</file> </code></pre> I hope this helps everyone.

Oozie shell script action

Tags:

bash

hadoop

hive

oozie

I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like:

#!/bin/bash
hive -f hivescript

Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script.

I've tried two ways, first using a local path, like hive -f /local/path/to/file, and using a relative path like above, hive -f hivescript, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml.

With both methods I get the error message: "Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]" on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know.

My job.properties file:

nameNode=hdfs://sandbox:8020
jobTracker=hdfs://sandbox:50300   
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozieProjectRoot=${nameNode}/user/sandbox/poc1
appPath=${oozieProjectRoot}/testwf
oozie.wf.application.path=${appPath}

And workflow.xml:

<shell xmlns="uri:oozie:shell-action:0.1">

    <job-tracker>${jobTracker}</job-tracker>

    <name-node>${nameNode}</name-node>

    <configuration>

        <property>

            <name>mapred.job.queue.name</name>

            <value>${queueName}</value>

        </property>

    </configuration>

    <exec>${appPath}/hive.sh</exec>

    <file>${appPath}/hive.sh</file> 

    <file>${appPath}/hive_pill</file>

</shell>

<ok to="end"/>

<error to="end"/>

</action>

<end name="end"/>

My objective is to use oozie to call a hive script through a shell script, please give your suggestions.

960

asked Mar 13 '14 21:03

thedragonwarrior

1 Answers

One thing that has always been tricky about Oozie workflows is the execution of bash scripts. Hadoop is created to be massively parallel so the architecture acts very different than you would think.

When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.

To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.

Here is a good way to approach what you are trying to achieve.

<shell xmlns="uri:oozie:shell-action:0.1">

    <exec>hive.sh</exec> 
    <file>/user/lib/hive.sh#hive.sh</file>
    <file>ETL_file1.hql#hivescript</file>

</shell>

One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed

To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.

Lastly you see the line:

<file>ETL_file1.hql#hivescript</file>

This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows

user/directory/workflow.xml
user/directory/ETL_file1.hql

and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.

HDFS directory notes,

if the file is nested inside the same directory as the workflow, then you only need to specify child paths:

user/directory/workflow.xml
user/directory/hive/ETL_file1.hql

Would yield:

<file>hive/ETL_file1.hql#hivescript</file>

But if the path is outside of the workflow directory you will need the full path:

user/directory/workflow.xml
user/lib/hive.sh

would yield:

<file>/user/lib/hive.sh#hive.sh</file>

I hope this helps everyone.

168

answered Sep 22 '22 17:09

Ryan Bedard

Related questions
                            
                                How to get output of grep in single line in shell script?
                            
                                How can I use a variable that contains a space?
                            
                                Escaping bash function arguments for use by su -c
                            
                                Shell: Connecting to a website and accessing a field
                            
                                Unix pipes and positional arguments
                            
                                Bash output the line with highest value
                            
                                $RANDOM in Bash doesn't work
                            
                                Parallelizing on a supercomputer and then combining the parallel results (R)
                            
                                Using find results when directories have spaces in their names
                            
                                Echo but retain double quotes
                            
                                Bash array variables: [@] or [*]?
                            
                                Convert exponentials and rounding numbers in BASH
                            
                                Bash Tab Completion Suggests Hidden SVN Files
                            
                                remove last 14 digits from string and a underscore, if there are 14 digits
                            
                                Shell script - check length when splitting string to array
                            
                                ssh and environment variables remote and local
                            
                                cp: missing destination file operand after
                            
                                How to kill respawned process by init in linux
                            
                                Move column to last in awk
                            
                                sed append text obtained from stdout

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With