Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing.

It looks something like this:

public abstract class Foo extends EvalFunc<Tuple> {
    public Foo() {
        super();
    }

    public Tuple exec(Tuple input) throws IOException {
        try {
            // do stuff with input
        } catch (Exception e) {
            throw WrappedIOException.wrap("Error with line", e);
        }
    }
}

My question is: if it throws the IOException, will it stop completely, or will it return results for the rest of the lines that don't throw an exception?

Example: I run this in pig

REGISTER myjar.jar
DEFINE Extractor com.namespace.Extractor();

logs = LOAD '$IN' USING TextLoader AS (line: chararray);
events = FOREACH logs GENERATE FLATTEN(Extractor(line));

With this input:

1.5 7 "Valid Line"
1.3 gghyhtt Inv"alid line"" I throw an exceptioN!!
1.8 10 "Valid Line 2"

Will it process the two lines and will 'logs' have 2 tuples, or will it just die in a fire?

like image 747
Daniel Huckstep Avatar asked Mar 29 '10 17:03

Daniel Huckstep


1 Answers

If the exception is thrown by the UDF the task will fail and will be retried.

It will fail again three more times (4 attempts by default) and the whole job will be FAILED.

If you want to log the error and do not want to have the Job stopped you can return a null:

public Tuple exec(Tuple input) throws IOException {
    try {
        // do stuff with input
    } catch (Exception e) {
        System.err.println("Error with ...");
        return null;
    }
}

And filter them later in Pig:

events_all = FOREACH logs GENERATE Extractor(line) AS line;
events_valid = FILTER events_all by line IS NOT null;
events = FOREACH events_valid GENERATE FLATTEN(line);

In your example the output will only have the two valid lines (but be careful with this behavior as the error is only present in the logs and won't fail your job!).

Reply to comment #1:

Actually, the whole resultant tuple would be null (so there is no fields inside).

For example if your schema has 3 fields:

 events_all = FOREACH logs
              GENERATE Extractor(line) AS line:tuple(a:int,b:int,c:int);

and some lines are incorrect we would get:

 ()
 ((1,2,3))
 ((1,2,3))
 ()
 ((1,2,3))

And if you don't filter the null line and try to access a field you get a java.lang.NullPointerException:

events = FOREACH events_all GENERATE line.a;
like image 151
Romain Avatar answered Sep 28 '22 00:09

Romain