I am trying to load a file with a schema that is primarily TSV (tab separated values), but one of the fields is a JSON value. It seems that pig latin has TextLoader for tab (or other) separated values, and JsonLoader for JSON...
In particular, the each row of data looks like:
date\tevent_name\tevent_details\n
where event_details is a JSON formatted string. The others are simply char arrays.
What is the easiest way to load this data?
Notes: i'm using Pig ver 0.11.1.
(After doing a bunch of research, here is the answer:)
Download the necessary libraries from http://mvnrepository.com/ needed for the register commands.
The pig script would be as follows:
register 'libs/elephant-bird-core-4.1.jar';
register 'libs/elephant-bird-pig-4.1.jar';
register 'libs/guava-14.0.1.jar';
register 'libs/json-simple-1.1.1.jar';
register 'libs/piggybank.jar';
define decode_json com.twitter.elephantbird.pig.piggybank.JsonStringToMap();
e1 = load '$filename' using PigStorage() as (
date: chararray,
event_name: chararray,
event_details_str: chararray,
);
-- Remove the header row:
e2 = filter e1 by not date matches '.*DATE';
-- Convert the event_details from a JSON string to a map:
events = foreach e2 generate *, decode_json(event_details_str) as event_details;
mbells answer works fine, one thing I was struggling with was how to retrieve the map values. Below is an example of retrieving key1, key2 from events map.
fields = FOREACH events GENERATE events#'key1', events#'key2';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With