Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load a TSV containing a JSON field in Pig Latin?

I am trying to load a file with a schema that is primarily TSV (tab separated values), but one of the fields is a JSON value. It seems that pig latin has TextLoader for tab (or other) separated values, and JsonLoader for JSON...

In particular, the each row of data looks like:

date\tevent_name\tevent_details\n

where event_details is a JSON formatted string. The others are simply char arrays.

What is the easiest way to load this data?

Notes: i'm using Pig ver 0.11.1.

like image 294
mbells Avatar asked Dec 11 '22 12:12

mbells


2 Answers

(After doing a bunch of research, here is the answer:)

Download the necessary libraries from http://mvnrepository.com/ needed for the register commands.

The pig script would be as follows:

register 'libs/elephant-bird-core-4.1.jar';
register 'libs/elephant-bird-pig-4.1.jar';
register 'libs/guava-14.0.1.jar';
register 'libs/json-simple-1.1.1.jar';
register 'libs/piggybank.jar';

define decode_json com.twitter.elephantbird.pig.piggybank.JsonStringToMap();

e1 = load '$filename' using PigStorage() as (
    date: chararray,
    event_name: chararray,
    event_details_str: chararray,
);

-- Remove the header row:
e2 = filter e1 by not date matches '.*DATE';

-- Convert the event_details from a JSON string to a map:
events = foreach e2 generate *, decode_json(event_details_str) as event_details;
like image 139
mbells Avatar answered Dec 30 '22 03:12

mbells


mbells answer works fine, one thing I was struggling with was how to retrieve the map values. Below is an example of retrieving key1, key2 from events map.

fields = FOREACH events GENERATE events#'key1', events#'key2';
like image 29
Richipal Avatar answered Dec 30 '22 05:12

Richipal