Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading Raw JSON into Pig

I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry,

{ 
"_id" : { "$oid" : "506492073401d91fa7fdffbe" }, 
"Body" : "....", 
"ViewCount" : 7351, 
"LastEditorDisplayName" : "Rich B", 
"Title" : ".....", 
"LastEditorUserId" : 140328, 
"LastActivityDate" : { "$date" : 1314819738077 }, 
"LastEditDate" : { "$date" : 1313882544213 }, 
"AnswerCount" : 12, "CommentCount" : 19, 
"AcceptedAnswerId" : 7, 
"Score" : 83, 
"PostTypeId" : "question", 
"OwnerUserId" : 8, 
"Tags" : [ "c#", "winforms" ], 
"CreationDate" : { "$date" : 1217540572667 }, 
"FavoriteCount" : 13, "Id" : 4, 
"ForumName" : "stackoverflow.com" 
}

Is there a way I can load a file where each line is one of the above into Pig without having to specify the schema by hand? Or perhaps a way to automatically generate a schema based on the (possibly nested) keys observed in all objects? If I do need to specify the schema by hand, what would the schema string look like?

Thanks!

like image 577
duckworthd Avatar asked Sep 28 '12 17:09

duckworthd


People also ask

How do you load data into a pig?

Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Which function is used to read the data in Pig?

Which of the following function is used to read data in PIG? Explanation: PigStorage is the default load function.

What is pig server?

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.


1 Answers

The quick and easy way: use Twitter's elephantbird project. Inside is a loader called com.twitter.elephantbird.pig.load.JsonLoader. When used directly like so,

A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
B = FOREACH A GENERATE json#'fieldName' AS field_name;

nested elements won't be loaded. However, you can easily fix that (if desired) by changing it to,

A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')

Including elephantbird is easy -- simply pull the the project "elephant-bird" with organization "com.twitter.elephantbird" using Maven (or equivalent's) dependency manager, then issuing the usual register command in pig

register 'lib/elephantbird.jar';
like image 63
duckworthd Avatar answered Oct 25 '22 15:10

duckworthd