Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to force STORE (overwrite) to HDFS in Pig?

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

like image 226
valid Avatar asked Jun 19 '12 22:06

valid


People also ask

How can we run the HDFS commands in Pig?

First you need to go ahead and load the data file in PigStorage. product = LOAD 'hdfs://localhost:9000/product_dir/products.csv' USING PigStorage(',') as (product_id:int, product_name:chararray, price:int); dump product; Next you can go ahead and execute the script file, which is stored in HDFS.

Is Pig still used in Hadoop?

Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-level language known as Pig Latin.

What is the default mode of Pig execution?

Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce). Tez Mode - To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation.


2 Answers

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

Example:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
like image 198
valid Avatar answered Sep 26 '22 02:09

valid


Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:

-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100

For a directory

-- Delete dir
fs -rm -r out
like image 21
Todd Nemet Avatar answered Sep 23 '22 02:09

Todd Nemet