I ran into an issue loading a set json documents into PIG. What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
I get the expected results
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
That should have been
(3,2,1)
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
How do I resolve this?
I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With