Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading json with varying schema into PIG

I ran into an issue loading a set json documents into PIG. What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.

I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.

What I have is a json document like this:

{"foo":1,"bar":2,"baz":3}

When I load this into PIG using this

Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;

I get the expected results

Json1: {foo: int,bar: int,baz: int}
(1,2,3)

However when the fields are in a different order in the schema :

Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;

I get an undesired result:

Json2: {baz: int,bar: int,foo: int}
(1,2,3)

That should have been

(3,2,1)

Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.

What I need is to load specific fields from a json file (with embedded documents!) into PIG.

How do I resolve this?

like image 314
Niels Basjes Avatar asked Mar 13 '13 21:03

Niels Basjes


1 Answers

I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.

Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.

like image 175
seedhead Avatar answered Sep 28 '22 08:09

seedhead