Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Pig: strip namespace prefix (::) after group operation

Tags:

apache-pig

A common pattern in my data processing is to group by some set of columns, apply a filter, then flatten again. For example:

my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = foreach my_data_grouped flatten(my_data);

The problem here is that if my_data starts with a schema like (c1, c2, c3) after this operation it will have a schema like (mydata::c1, mydata::c2, mydata::c3). Is there a way to easily strip off the "mydata::" prefix if the columns are unique?

I know I can do something like this:

my_data = foreach my_data generate c1 as c1, c2 as c2, c3 as c3;

However that gets awkward and hard to maintain for data sets with lots of columns and is impossible for data sets with variable columns.

like image 547
Nick Avatar asked Jun 11 '12 22:06

Nick


2 Answers

If all fields in a schema have the same set of prefixes (e.g. group1::id, group1::amount, etc) you can ignore the prefix when referencing specific fields (and just reference them as id, amount, etc)

Alternatively, if you're still looking to strip a schema of a single level of prefixing you can use a UDF like this:

public class RemoveGroupFromTupleSchema extends EvalFunc<Tuple> {

@Override
public Tuple exec(Tuple input) throws IOException {
    Tuple result = input;
    return result;
}


@Override
public Schema outputSchema(Schema input) throws FrontendException {
    if(input.size() != 1) {
        throw new RuntimeException("Expected input (tuple) but input does not have 1 field");
    }

    List<Schema.FieldSchema> inputSchema = input.getFields();
    List<Schema.FieldSchema> outputSchema = new ArrayList<Schema.FieldSchema>(inputSchema);
    for(int i = 0; i < inputSchema.size(); i++) {
        Schema.FieldSchema thisInputFieldSchema = inputSchema.get(i);
        String inputFieldName = thisInputFieldSchema.alias;
        Byte dataType = thisInputFieldSchema.type;

        String outputFieldName;
        int findLoc = inputFieldName.indexOf("::");
        if(findLoc == -1) {
            outputFieldName = inputFieldName;
        }
        else {
            outputFieldName = inputFieldName.substring(findLoc+2);
        }
        Schema.FieldSchema thisOutputFieldSchema = new Schema.FieldSchema(outputFieldName, dataType);
        outputSchema.set(i, thisOutputFieldSchema);
    }

    return new Schema(outputSchema);
}
}
like image 73
JDag Avatar answered Nov 02 '22 14:11

JDag


You can put the 'AS' statement on the same line as the 'foreach'.

i.e.

my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = FOREACH my_data_grouped FLATTEN(my_data) AS (c1, c2, c3);

However, this is just the same as doing it on 2 lines, and does not alleviate your issue for 'data sets with variable columns'.

like image 20
barryred Avatar answered Nov 02 '22 16:11

barryred