Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Pig error 2218 caused by changing to Pig version 0.10.0

Tags:

apache-pig

I'm at my wits end trying to solve this one. I have scripts and UDFs that run perfectly with Pig 0.8.1, but when I try to run with Pig 0.10.0, I get:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2218: Invalid resource schema: bag schema  must have tuple as its field

The code that calls the UDF from the Pig script looks like this:

    parsed = LOAD '$INPUT' 
    USING pignlproc.storage.ParsingWikipediaLoader('$LANG')
    AS (title, id, pageUrl, text, redirect, links, headers, paragraphs);

The ParsingWikipediaLoader class implements LoadMetaData, and the getSchema() method looks like this:

    public ResourceSchema getSchema(String location, Job job)
        throws IOException {
    Schema schema = new Schema();
    schema.add(new FieldSchema("title", DataType.CHARARRAY));
    schema.add(new FieldSchema("id", DataType.CHARARRAY));
    schema.add(new FieldSchema("uri", DataType.CHARARRAY));
    schema.add(new FieldSchema("text", DataType.CHARARRAY));
    schema.add(new FieldSchema("redirect", DataType.CHARARRAY));
    Schema linkInfoSchema = new Schema();
    linkInfoSchema.add(new FieldSchema("target", DataType.CHARARRAY));
    linkInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
    linkInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
    schema.add(new FieldSchema("links", linkInfoSchema, DataType.BAG));
    Schema headerInfoSchema = new Schema();
    headerInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY));
    headerInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
    headerInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
    schema.add(new FieldSchema("headers", headerInfoSchema, DataType.BAG));
    Schema paragraphInfoSchema = new Schema();
    paragraphInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY));
    paragraphInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
    paragraphInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
    schema.add(new FieldSchema("paragraphs", paragraphInfoSchema,
            DataType.BAG));

    return new ResourceSchema(schema);
}

Again, the script and UDF work as expected with Pig 0.8.1, so this has to be some difference between the versions. I've searched thoroughly, but can't find anything about this in the documentation, or on Stack Overflow.

like image 756
chokamp Avatar asked Nov 17 '25 03:11

chokamp


1 Answers

Looks like the difference is in the ResourceFieldSchema constructor.

0.8.1 detects a Bag and wraps the inner schema in a tuple, whereas this logic has been removed from 0.10.0. I guess you need to amend your schema definition to wrap the bag schemas in a tuple:

schema.add(new FieldSchema("links", new Schema(
     new FieldSchema("t", linkInfoSchema)), DataType.BAG));

This does however produce a tuple-in-tuple like schema when used on 0.8.1:

  • 0.10.0: {title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}
  • 0.8.1: {title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (t: (target: chararray,begin: int,end: int))},headers: {t: (t: (tagname: chararray,begin: int,end: int))},paragraphs: {t: (t: (tagname: chararray,begin: int,end: int))}}

You can fix this by amending the two level access required flag to true:

    Schema linkInfoSchema = new Schema();
    linkInfoSchema.add(new FieldSchema("target", DataType.CHARARRAY));
    linkInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
    linkInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
    Schema linkInfoSchemaTupleWrapper = new Schema(new FieldSchema("t",
            linkInfoSchema));
    linkInfoSchemaTupleWrapper.setTwoLevelAccessRequired(true);
    schema.add(new FieldSchema("links", linkInfoSchemaTupleWrapper, DataType.BAG));

Which then produces an identical schema between 0.10.0 and 0.8.1:

{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}

{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}

0.10.0:

    /**
     * Construct using a {@link org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema} as the template.
     * @param fieldSchema fieldSchema to copy from
     */
    public ResourceFieldSchema(FieldSchema fieldSchema) {
        type = fieldSchema.type;
        name = fieldSchema.alias;
        description = "autogenerated from Pig Field Schema";
        Schema inner = fieldSchema.schema;

        // allow partial schema 
        if ((type == DataType.BAG || type == DataType.TUPLE || type == DataType.MAP)
                && inner != null) {
            schema = new ResourceSchema(inner);
        } else {
            schema = null;
        }
    }

0.8.1

    /**
     * Construct using a {@link org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema} as the template.
     * @param fieldSchema fieldSchema to copy from
     */
    public ResourceFieldSchema(FieldSchema fieldSchema) {
        type = fieldSchema.type;
        name = fieldSchema.alias;
        description = "autogenerated from Pig Field Schema";
        Schema inner = fieldSchema.schema;
        if (type == DataType.BAG && fieldSchema.schema != null
                && !fieldSchema.schema.isTwoLevelAccessRequired()) { 
            log.info("Insert two-level access to Resource Schema");
            FieldSchema fs = new FieldSchema("t", fieldSchema.schema);
            inner = new Schema(fs);                
        }

        // allow partial schema 
        if ((type == DataType.BAG || type == DataType.TUPLE)
                && inner != null) {
            schema = new ResourceSchema(inner);
        } else {
            schema = null;
        }
    }
like image 161
Chris White Avatar answered Nov 20 '25 04:11

Chris White



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!