How to force STORE (overwrite) to HDFS in Pig?

Tags:

hdfs

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

226

asked Jun 19 '12 22:06

valid

2 Answers

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

Example:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';

198

answered Sep 26 '22 02:09

valid

Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:

-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100

For a directory

-- Delete dir
fs -rm -r out

answered Sep 23 '22 02:09

Todd Nemet

Related questions
                            
                                A way to export the results from Pig to a database
                            
                                strsplit issue - Pig
                            
                                How do I suppress the bloat of useless information when using the DUMP command while using grunt via 'pig -x local'?
                            
                                Error in pig while loading data
                            
                                how to include external jar file using PIG
                            
                                Join vs COGROUP in PIG
                            
                                Load only particular field in PIG?
                            
                                How to perform a DISTINCT in Pig Latin on a subset of columns?
                            
                                Define tuple datas in the pig script
                            
                                Filtering null values with pig
                            
                                What is the best Pig plugin for Eclipse?
                            
                                How can I incorporate the current input filename into my Pig Latin script?
                            
                                STORE output to a single CSV?
                            
                                How to use Cassandra's Map Reduce with or w/o Pig?
                            
                                Computing median in map reduce
                            
                                Skipping the header while loading the text file using Piglatin
                            
                                Is there any Conditional IF like operator in Apache PIG?
                            
                                Pig: Get top n values per group
                            
                                select count distinct using pig latin
                            
                                Connection Error in Apache Pig

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to force STORE (overwrite) to HDFS in Pig?

Tags:

apache-pig

hdfs

valid

People also ask

2 Answers

valid

Todd Nemet

Recent Activity

Donate For Us